System and method for executing a game process

ABSTRACT

A 3-D imaging system for recognition and interpretation of gestures to control a computer. The system includes a 3-D imaging system that performs gesture recognition and interpretation based on a previous mapping of a plurality of hand poses and orientations to user commands for a given user. When the user is identified to the system, the imaging system images gestures presented by the user, performs a lookup for the user command associated with the captured image(s), and executes the user command(s) to effect control of the computer, programs, and connected devices.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation-in-part of pending U.S. patentapplication Ser. No. 10/396,653 entitled “ARCHITECTURE FOR CONTROLLING ACOMPUTER USING HAND GESTURES”, which was filed Mar. 25, 20032, theentirety of which is incorporated by reference.

TECHNICAL FIELD

The present invention relates generally to controlling a computersystem, and more particularly to a system and method to implementalternative modalities for controlling computer programs and devices,and manipulating on-screen objects through the use of one or more bodygestures, or a combination of gestures and supplementary signals.

BACKGROUND OF THE INVENTION

A user interface facilitates the interaction between a computer andcomputer user by enhancing the user's ability to utilize applicationprograms. The traditional interface between a human user and a typicalpersonal computer is implemented with graphical displays and isgenerally referred to as a graphical user interface (GUI). Input to thecomputer or particular application program is accomplished through thepresentation of graphical information on the computer screen and throughthe use of a keyboard and/or mouse, trackball or other similarimplements. Many systems employed for use in public areas utilize touchscreen implementations whereby the user touches a designated area of ascreen to effect the desired input. Airport electronic ticket check-inkiosks and rental car direction systems are examples of such systems.There are, however, many applications where the traditional userinterface is less practical or efficient.

The traditional computer interface is not ideal for a number ofapplications. Providing stand-up presentations or other type of visualpresentations to large audiences, is but one example. In this example, apresenter generally stands in front of the audience and provides averbal dialog in conjunction with the visual presentation that isprojected on a large display or screen. Manipulation of the presentationby the presenter is generally controlled through use of awkward remotecontrols, which frequently suffer from inconsistent and less preciseoperation, or require the cooperation of another individual. Traditionaluser interfaces require the user either to provide input via thekeyboard or to exhibit a degree of skill and precision more difficult toimplement with a remote control than a traditional mouse and keyboard.Other examples include control of video, audio, and display componentsof a media room. Switching between sources, advancing fast fast-forward,rewinding, changing chapters, changing volume, etc., can be verycumbersome in a professional studio as well as in the home. Similarly,traditional interfaces are not well suited for smaller, specializedelectronic gadgets.

Additionally, people with motion impairment conditions find it verychallenging to cope with traditional user interfaces and computer accesssystems. Such conditions include Cerebral Palsy, Muscular Dystrophy,Friedrich's Ataxia, and spinal injuries or disorders. These conditionsand disorders are often accompanied by tremors, spasms, loss ofcoordination, restricted range of movement, reduced muscle strength, andother motion impairing symptoms.

Similar symptoms exist in the growing elderly segment of the population.As people age, their motor skills decline and impact the ability toperform many tasks. It is known that as people age, their cognitive,perceptual and motor skills decline, with negative effects in theirability to perform many tasks. The requirement to position a cursor,particularly with smaller graphical presentations, can often be asignificant barrier for elderly or afflicted computer users. Computerscan play an increasingly important role in helping older adults functionwell in society.

Graphical interfaces contribute to the ease of use of computers. WIMP(Window, Icon, Menu, Pointing device (or Pull-down menu)) interfacesallow fairly non-trivial operations to be performed with a few mousemotions and clicks. However, at the same time, this shift in the userinteraction from a primarily text-oriented experience to apoint-and-click experience has erected new barriers between people withdisabilities and the computer. For example, for older adults, there isevidence that using the mouse can be quite challenging. There isextensive literature demonstrating that the ability to make smallmovements decreases with age. This decreased ability can have a majoreffect on the ability of older adults to use a pointing device on acomputer. It has been shown that even experienced older computer usersmove a cursor much more slowly and less accurately than their youngercounterparts. In addition, older adults seem to have increaseddifficulty (as compared to younger users) when targets become smaller.For older computer users, positioning a cursor can be a severelimitation.

One solution to the problem of decreased ability to position the cursorwith a mouse is to simply increase the size of the targets in computerdisplays, which can often be counter-productive since less informationis being displayed, requiring more navigation. Another approach is toconstrain the movement of the mouse to follow on-screen objects, as withsticky icons or solid borders that do not allow cursors to overshoot thetarget. There is evidence that performance with area cursors (possiblytranslucent) is better than performance with regular cursors for sometarget acquisition tasks.

One method to facilitate computer access for users with motionimpairment conditions and for applications, in which the traditionaluser interfaces are cumbersome, is through use of perceptual userinterfaces. Perceptual user interfaces utilize alternate sensingmodalities, such as the capability of sensing physical gestures of theuser, to replace or complement traditional input devices such as themouse and keyboard. Perceptual user interfaces promise modes of fluidcomputer-human interaction that complement and/or replace the mouse andkeyboard, particularly in non-desktop applications such as control for amedia room.

One study indicates that adding a simple gesture-based navigationfacility to web browsers can significantly reduce the time taken tocarry out one of the most common actions in computer use, i.e., usingthe “back” button (or function) to return to previously visited pages.Subjective ratings by users in experiments showed a strong preferencefor a “flick” system, where the users would flick the mouse left orright to go back or forward in the web browser.

In the simplest view, gestures play a symbolic communication rolesimilar to speech, suggesting that for simple tasks gestures can enhanceor replace speech recognition. Small gestures near the keyboard or mousedo not induce fatigue as quickly as sustained whole arm postures.Previous studies indicate that users find gesture-based systems highlydesirable, but that users are also dissatisfied with the recognitionaccuracy of gesture recognizers. Furthermore, experimental resultsindicate that a user's difficulty with gestures is in part due to a lackof understanding of how gesture recognition works. The studies highlightthe ability of users to learn and remember gestures as an importantdesign consideration.

Even when a mouse and keyboard are available, users may find itattractive to manipulate often-used applications while away from thekeyboard, in what can be called a “casual interface” or “lean-back”posture. Browsing e-mail over morning coffee might be accomplished bymapping simple gestures to “next message” and “delete message”.

Gestures can compensate for the limitations of the mouse when thedisplay is several times larger than a typical display. In such ascenario, gestures can provide mechanisms to restore the ability toquickly reach any part of the display, where once a mouse was adequatewith a small display. Similarly, in a multiple display scenario it isdesirable to have a fast comfortable way to indicate a particulardisplay. For example, the foreground object can be “bumped” to anotherdisplay by gesturing in the direction of the target display.

However, examples of perceptual user interfaces to date are dependent onsignificant limiting assumptions. One type of perceptual user interfaceutilizes color models that make certain assumptions about the color ofan object. Proper operation of the system is dependent on properlighting conditions and can be negatively impacted when the system ismoved from one location to another as a result of changes in lightingconditions, or simply when the lighting conditions change in the room.Factors that impact performance include sun light versus artificiallight, florescent light versus incandescent light, direct illuminationversus indirect illumination, and the like. Additionally, most attemptsto develop perceptual user interfaces require the user to wearspecialized devices such as gloves, headsets, or close-talk microphones.The use of such devices is generally found to be distracting andintrusive for the user.

Thus perceptual user interfaces have been slow to emerge. The reasonsinclude heavy computational burdens, unreasonable calibration demands,required use of intrusive and distracting devices, and a general lack ofrobustness outside of specific laboratory conditions. For these andsimilar reasons, there has been little advancement in systems andmethods for exploiting perceptual user interfaces. However, as the trendtowards smaller, specialized electronic gadgets continues to grow, sodoes the need for alternate methods for interaction between the user andthe electronic device. Many of these specialized devices are too smalland the applications unsophisticated to utilize the traditional inputkeyboard and mouse devices. Examples of such devices include TabletPCs,Media center PCs, kiosks, hand held computers, home appliances, videogames, and wall sized displays, along with many others. In these, andother applications, the perceptual user interface provides a significantadvancement in computer control over traditional computer interactionmodalities.

In light of these findings, what is needed is to standardize a small setof easily learned gestures, the semantics of which are determined byapplication context. A small set of very simple gestures can offersignificant bits of functionality where they are needed most. Forexample, dismissing a notification window can be accomplished by a quickgesture to the one side or the other, as in shooing a fly. Anotherexample is gestures for “next” and “back” functionality found in webbrowsers, presentation programs (e.g., PowerPoint™) and otherapplications. Note that in many cases the surface forms of these variousgestures can remain the same throughout these examples, while thesemantics of the gestures depends on the application at hand. Providinga small set of standard gestures eases problems users have in recallinghow gestures are performed, and also allows for simpler and more robustsignal processing and recognition processes.

SUMMARY OF THE INVENTION

The following presents a simplified summary of the invention in order toprovide a basic understanding of some aspects of the invention. Thissummary is not an extensive overview of the invention. It is intended toneither identify key or critical elements of the invention nor delineatethe scope of the invention. Its sole purpose is to present some conceptsof the invention in a simplified form as a prelude to the more detaileddescription that is presented later.

The present invention disclosed and claimed herein, in one aspectthereof, comprises a system for controlling a computer using gestures.The system includes a 3-D imaging system that performs gesturerecognition and interpretation based on a previous mapping of aplurality of hand poses and orientations to user commands for a givenuser. When the user is identified to the system, the imaging systemimages gestures presented by the user, performs a lookup for the usercommand associated with the captured image(s), and executes the usercommand(s) to effect control of the computer, programs, and connecteddevices.

In another aspect of the present invention, the system includes awireless device worn by the person. The wireless device includes one ormore sensors that measure at least velocity, acceleration, andorientation of the device. The corresponding signals are transmitted toa computer system, processed, and interpreted to determine an object atwhich the device is pointed and the action to be taken on the object.Once the signals have been interpreted, the computer is controlled tointeract with the object, which object can be a device and/or systemconnected to the computer, and software running on the computer. In oneapplication, the wireless device is used in a medical environment andworn on the head of a medical person allowing free use of the hands.Head movements facilitate control of the computer. In another multimodalapproach, the person can also wear a wireless microphone to communicatevoice signals to the computer separately or in combination with headmovements for control thereof.

In yet another aspect of the present invention, a multimodal approachcan be employed such that a person uses the wireless device incombination with the imaging capabilities of the 3-D imaging system.

In still another aspect of the present invention, the multimodalapproach includes any combination of the 3-D imaging system, thewireless device, and vocalization to control the computer system and,hardware and software associated therewith. This approach findsapplication in a medical environment such as an operating room, forexample.

In another aspect of the present invention, an engagement volume isemployed in a medical environment such that one or both hands of themedical person are free to engage the volume and control the computersystem, during, for example, a patient operation. The volume is definedin space over the part of the patient undergoing the operation, and thehands of the medical person are used in the form of gestures to controlthe system for the presentation of medical information.

In accordance with another aspect thereof, the present inventionfacilitates adapting the system to the particular preferences of anindividual user. The system and method allow the user to tailor thesystem to recognize specific hand gestures and verbal commands and toassociate these hand gestures and verbal commands with particularactions to be taken. This capability allows different users, which mayprefer to make different motions for a given command, the ability totailor the system in a way most efficient for their personal use.Similarly, different users can choose to use different verbal commandsto perform the same function.

In still another aspect of the present invention, the system employs alearning capability such that nuances of a user can be learned by thesystem and adapted to the user profile of gestures, vocalizations, etc.

The following description and the annexed drawings set forth in detailcertain illustrative aspects of the invention. These aspects areindicative, however, of but a few of the various ways in which theprinciples of the invention can be employed and the present invention isintended to include all such aspects and their equivalents. Otheradvantages and novel features of the invention will become apparent fromthe following detailed description of the invention when considered inconjunction with the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a system block diagram of components of the presentinvention for controlling a computer and/or other hardware/softwareperipherals interfaced thereto.

FIG. 2 illustrates a schematic block diagram of a perceptual userinterface system, in accordance with an aspect of the present invention.

FIG. 3 illustrates a flow diagram of a methodology for implementing aperceptual user interface system, in accordance with an aspect of thepresent invention.

FIG. 4 illustrates a flow diagram of a methodology for determining thepresence of moving objects within images, in accordance with an aspectof the present invention.

FIG. 5 illustrates a flow diagram of a methodology for tracking a movingobject within an image, in accordance with an aspect of the presentinvention.

FIG. 6 illustrates a disparity between two video images captured by twovideo cameras mounted substantially parallel to each other for thepurpose of determining the depth of objects, in accordance with anaspect of the present invention.

FIG. 7 illustrates an example of the hand gestures that the system canrecognize and the visual feedback provided through the display, inaccordance with an aspect of the present invention.

FIG. 8 illustrates an alternative embodiment wherein a unique icon isdisplayed in association with a name of a specific recognized command,in accordance with an aspect of the present invention.

FIGS. 9A and 9B illustrate an engagement plane and volume of both singleand multiple monitor implementations, in accordance with an aspect ofthe present invention.

FIG. 10 illustrates a briefing room environment where gestures areutilized to control a screen projector via a computer system configuredin accordance with an aspect of the present invention.

FIG. 11 illustrates a block diagram of a computer system operable toexecute the present invention.

FIG. 12 illustrates a network implementation of the present invention.

FIG. 13 illustrates a medical operating room system that uses theengagement volume in accordance with the present invention.

FIG. 14 illustrates a medical operating room environment in which acomputer control system with wireless control device is employed inaccordance with the present invention.

FIG. 15 illustrates a flowchart of a process from the perspective of theperson for using the system of FIG. 14.

FIG. 16 illustrates a flowchart of a process from the perspective of thesystem of FIG. 14.

FIG. 17 illustrates a medical environment in which a 3-D imagingcomputer control system is employed to process hand (or body) gesturesin accordance with the present invention.

FIG. 18 illustrates a flowchart of a process from the perspective of theperson for using the system of FIG. 17.

FIG. 19 illustrates a flowchart of a process from the perspective of thesystem of FIG. 17.

FIG. 20 illustrates a medical environment in which a 3-D imagingcomputer control system is employed with the remote control device toprocess hand (or body) gestures and control the system in accordancewith the present invention.

FIG. 21A illustrates a sample one-handed and two-handed gestures thatcan be used to control the operation computing system in accordance withthe present invention.

FIG. 21B illustrates an additional sample one-handed gestures andsequenced one-handed gestures that can be used to control the operationcomputing system in accordance with the present invention.

FIG. 21C illustrates additional sample one-handed gestures that can beused to control the operation computing system in accordance with thepresent invention.

FIG. 21D illustrates additional sample one-handed gestures used incombination with voice commands that can be used to control theoperation computing system in accordance with the present invention.

FIG. 21E illustrates additional sample one-handed gestures used incombination with voice commands and gaze signals that can be used tocontrol the operation computing system in accordance with the presentinvention.

DETAILED DESCRIPTION OF THE INVENTION

As used in this application, the terms “component” and “system” areintended to refer to a computer-related entity, either hardware, acombination of hardware and software, software, or software inexecution. For example, a component may be, but is not limited to being,a process running on a processor, a processor, an object, an executable,a thread of execution, a program, and/or a computer. By way ofillustration, both an application running on a server and the server canbe a component. One or more components may reside within a processand/or thread of execution and a component may be localized on onecomputer and/or distributed between two or more computers.

The present invention relates to a system and methodology forimplementing a perceptual user interface comprising alternativemodalities for controlling computer programs and manipulating on-screenobjects through hand gestures or a combination of hand gestures and/orverbal commands. A perceptual user interface system is provided thattracks hand movements and provides for the control of computer programsand manipulation of on-screen objects in response to hand gesturesperformed by the user. Similarly, the system provides for the control ofcomputer programs and manipulation of on-screen objects in response toverbal commands spoken by the user. Further, the gestures and/or verbalcommands can be tailored by a particular user to suit that user'spersonal preferences. The system operates in real time and is robust,light-weight and responsive. The system provides a relativelyinexpensive capability for the recognition of hand gestures and verbalcommands.

Referring now to FIG. 1, there is illustrated a system block diagram ofcomponents of the present invention for controlling a computer and/orother hardware/software peripherals interfaced thereto. The system 100includes a tracking component 102 for detecting and tracking one or moreobjects 104 through image capture utilizing cameras (not shown) or othersuitable conventional image-capture devices. The cameras operate tocapture images of the object(s) 104 in a scene within the image capturecapabilities of the cameras so that the images can be further processedto not only detect the presence of the object(s) 104, but also to detectand track object(s) movements. It is appreciated that in more robustimplementations, object characteristics such as object features andobject orientation can also be detected, tracked, and processed. Theobject(s) 104 of the present invention include basic hand movementscreated by one or more hands of a system user and/or other personselected for use with the disclosed system. However, in more robustsystem implementations, such objects can include many different types ofobjects with object characteristics, including hand gestures each ofwhich have gesture characteristics including but not limited to, handmovement, finger count, finger orientation, hand rotation, handorientation, and hand pose (e.g., opened, closed, and partially closed).

The tracking component 102 interfaces to a control component 106 of thesystem 100 that controls all onboard component processes. The controlcomponent 106 interfaces to a seeding component 108 that seeds objecthypotheses to the tracking component based upon the objectcharacteristics.

The object(s) 104 are detected and tracked in the scene such that objectcharacteristic data is processed according to predetermined criteria toassociate the object characteristic data with commands for interactingwith a user interface component 110. The user interface component 110interfaces to the control component 106 to receive control instructionsthat affect presentation of text, graphics, and other output (e.g.,audio) provided to the user via the interface component 110. The controlinstructions are communicated to the user interface component 110 inresponse to the object characteristic data processed from detection andtracking of the object(s) within a predefined engagement volume space112 of the scene.

A filtering component 114 interfaces to the control component 106 toreceive filtering criteria in accordance with user filter configurationdata, and to process the filtering criteria such that tracked object(s)of respective object hypotheses are selectively removed from the objecthypotheses and/or at least one hypothesis from a set of hypotheseswithin the volume space 112 and the scene. Objects are detected andtracked either within the volume space 112 or outside the volume space112. Those objects outside of the volume space 112 are detected,tracked, and ignored, until entering the volume space 112.

The system 100 also receives user input via input port(s) 116 such asinput from pointing devices, keyboards, interactive input mechanismssuch as touch screens, and audio input devices.

The subject invention (e.g., in connection with object detection,tracking, and filtering) can employ various artificial intelligencebased schemes for carrying out various aspects of the subject invention.For example, a process for determining which object is to be selectedfor tracking can be facilitated via an automatic classification systemand process. Such classification can employ a probabilistic and/orstatistical-based analysis (e.g., factoring into the analysis utilitiesand costs) to prognose or infer an action that a user desires to beautomatically performed. For example, a support vector machine (SVM)classifier can be employed. Other classification approaches includeBayesian networks, decision trees, and probabilistic classificationmodels providing different patterns of independence can be employed.Classification as used herein also is inclusive of statisticalregression that is utilized to develop models of priority.

As will be readily appreciated from the subject specification, thesubject invention can employ classifiers that are explicitly trained(e.g., via a generic training data) as well as implicitly trained (e.g.,via observing user behavior, receiving extrinsic information) so thatthe classifier(s) is used to automatically determine according to apredetermined criteria which object(s) should be selected for trackingand which objects that were being tracked are now removed from tracking.The criteria can include, but is not limited to, object characteristicssuch as object size, object speed, direction of movement, distance fromone or both cameras, object orientation, object features, and objectrotation. For example, with respect to SVM's which are wellunderstood—it is to be appreciated that other classifier models can alsobe utilized such as Naive Bayes, Bayes Net, decision tree and otherlearning models—SVM's are configured via a learning or training phasewithin a classifier constructor and feature selection module. Aclassifier is a function that maps an input attribute vector, x=(x1, x2,x3, x4, . . . , xn), to a confidence that the input belongs to aclass—that is, f(x)=confidence(class). In the case of objectidentification and tracking, for example, attributes include varioussizes of the object, various orientations of the object, object colors,and the classes are categories or areas of interest (e.g., object type,and object pose).

Referring now to FIG. 2, there is illustrated a schematic block diagramof a perceptual user interface system, in accordance with an aspect ofthe present invention. The system comprises a computer 200 with atraditional keyboard 202, input pointing device (e.g., a mouse) 204,microphone 206, and display 208. The system further comprises at leastone video camera 210, at least one user 212, and software 214. Theexemplary system of FIG. 2 is comprised of two video cameras 210 mountedsubstantially parallel to each other (that is, the rasters are parallel)and the user 212. The first camera is used to detect depth of the objectfrom the camera and track the object, and the second camera is used fordetermining at least the depth (or distance) of the object from thecamera(s). The computer 200 is operably connected to the keyboard 202,mouse 204 and display 208. Video cameras 210 and microphone 206 are alsooperably connected to computer 200. The video cameras 210 “look” towardsthe user 212 and may point downward to capture objects within the volumedefined above the keyboard and in front of the user. User 212 istypically an individual that is capable of providing hand gestures,holding objects in a hand, verbal commands, and mouse and/or keyboardinput. The hand gestures and/or object(s) appear in video images createdby the video cameras 210 and are interpreted by the software 214 ascommands to be executed by computer 200. Similarly, microphone 206receives verbal commands provided by user 212, which are in turn,interpreted by software 214 and executed by computer 200. User 212 cancontrol and operate various application programs on the computer 200 byproviding a series of hand gestures or a combination of hand gestures,verbal commands, and mouse/keyboard input. The system can track anyobject presented in the scene in front or it. The depth information isused to “segment” the interacting object from the rest of the scene. Thecapability to exploit any sort of moving object in the scene isimportant at least with respect to accessibility scenarios.

In view of the foregoing structural and functional features describedabove, methodologies in accordance with various aspects of the presentinvention will be better appreciated with reference to FIGS. 3-5. While,for purposes of simplicity of explanation, the methodologies of FIGS.3-5 are shown and described as executing serially, it is to beunderstood and appreciated that the present invention is not limited bythe illustrated order, as some aspects could, in accordance with thepresent invention, occur in different orders and/or concurrently withother aspects from that shown and described herein. Moreover, not allillustrated features may be required to implement a methodology inaccordance with an aspect the present invention.

Accordingly, FIG. 3 is a flow diagram that illustrates a high levelmethodology for detecting the user's hand, tracking movement of the handand interpreting commands in accordance with an aspect of the invention.While, for purposes of simplicity of explanation, the methodologiesshown here and below are described as a series of acts, it is to beunderstood and appreciated that the present invention is not limited bythe order of acts, as some acts may, in accordance with the presentinvention, occur in different orders and/or concurrently with other actsfrom that shown and described herein. For example, those skilled in theart will understand and appreciate that a methodology couldalternatively be represented as a series of interrelated states orevents, such as in a state diagram. Moreover, not all illustrated actsmay be required to implement a methodology in accordance with thepresent invention.

The methodology begins at 300 where video images are scanned todetermine whether any moving objects exist within the field of view (orscene) of the cameras. The system is capable of running one or moreobject hypothesis models to detect and track objects, whether moving ornot moving. In one embodiment, the system runs up to and including sixobject hypotheses. If more than one object is detected as a result ofthe multiple hypotheses, the system drops one of the objects if thedistance from any other object falls below a threshold distance, forexample, five inches. It is assumed that the two hypotheses areredundantly tracking the same object, and one of the hypotheses isremoved from consideration. At 302, if NO, no moving object(s) have beendetected, and flow returns to 300 where the system continues to scan thecurrent image for moving objects. Alternatively, if YES, object movementhas been detected, and flow continues from 302 to 304 where it isdetermined whether or not one or more moving objects are within theengagement volume. It is appreciated that the depth of the object may bedetermined before determination of whether the object is within theengagement volume.

The engagement volume is defined as a volume of space in front of thevideo cameras and above the keyboard wherein the user is required tointroduce the hand gestures (or object(s)) in order to utilize thesystem. A purpose of the engagement volume is to provide a means forignoring all objects and/or gestures in motion except for those intendedby the user to effect control of the computer. If a moving object isdetected at 302, but is determined not to be within the engagementvolume, then the system dismisses the moving object as not being adesired object to track for providing commands. Flow then loops back tothe input of 300 to scan for more objects. However, if the moving objectis determined to be within the engagement volume, then the methodologyproceeds to 306. However, new objects are seeded only when it isdetermined that the new object is a sufficient distance away from anexisting object that is being tracked (in 3-D). At 306, the systemdetermines the distance of each moving object from the video cameras. At308, if more than one moving object is detected within the engagementvolume, then the object closest to the video camera(s) is selected asthe desired command object. If by the given application context the useris predisposed to use hand gestures towards the display, the nearestobject hypotheses will apply to the hands. In other scenarios, moreelaborate criteria for object selection may be used. For example, anapplication may select a particular object based upon its quality ofmovement over time. Additionally, a two-handed interaction applicationmay select an object to the left of the dominant hand (for right handedusers) as the non-dominant hand. The command object is the object thathas been selected for tracking, the movements of which will be analyzedand interpreted for gesture commands. The command object is generallythe user's dominant hand. Once the command object is selected, itsmovement is tracked, as indicated at 310.

At 312, the system determines whether the command object is still withinthe engagement volume. If NO, the object has moved outside theengagement volume, and the system dismisses the object hypothesis andreturns to 300 where the current image is processed for moving objects.If NO, the object is still within the engagement volume, and flowproceeds to 314. At 314, the system determines whether the object isstill moving. If no movement is detected, flow is along the NO pathreturning to 300 to process the current camera images for movingobjects. If however, movement is detected, then flow proceeds from 314to 316. At 316, the system analyzes the movements of the command objectto interpret the gestures for specific commands. At 318, it isdetermined whether the interpreted gesture is a recognized command. IfNO, the movement is not interpreted as a recognized command, and flowreturns to 310 to continue tracking the object. However, if the objectmovement is interpreted as a recognized command, flow is to 320 wherethe system executes the corresponding command. After execution thereof,flow returns to 310 to continue tracking the object. This process maycontinually execute to detect and interpret gestures.

In accordance with an aspect of the invention, algorithms used tointerpret gestures are kept to simple algorithms and are performed onsparse (“lightweight”) images to limit the computational overheadrequired to properly interpret and execute desired commands in realtime. In accordance with another aspect of the invention, the system isable to exploit the presence of motion and depth to minimizecomputational requirements involved in determining objects that providegesture commands.

Referring now to FIG. 4, there is illustrated a flow diagram of amethodology for determining the presence of moving objects within videoimages created by one or more video sources, in accordance with anaspect of the present invention. The methodology exploits the notionthat attention is often drawn to objects that move. At 400, video datais acquired from one or more video sources. Successive video images areselected from the same video source, and motion is detected by comparinga patch of a current video image, centered on a given location, to apatch from the previous video image centered on the same location. At402, a video patch centered about a point located at (u₁, v₁), and (u₂,v₂) is selected from successive video images I₁ and I₂, respectively. Asimple comparison function is utilized wherein the sum of the absolutedifferences (SAD) over square patches in two images is obtained. For apatch from image I₁ centered on pixel location (u₁, v₁) and a patch inimage I₂ centered on (u₂, v₂), the image comparison function is definedas SAD(I₁,u₁v₁,I₂,u₂,v₂) as:

$\sum\limits_{{{- \frac{D}{2}} \leq i},{j \leq \frac{D}{2}}}\; {{{I_{1}( {{u_{1} + },{v_{1} + j}} )} - {I_{2}( {{u_{2} + },{v_{2} + j}} )}}}$

where I(u, v) refers to the pixel at (u, v), D is the patch width, andthe absolute difference between two pixels is the sum of the absolutedifferences taken over all available color channels. Regions in theimage that have movement can be found by determining points (u, v) suchthat function SAD(I_(t-1),u_(t-1),v_(t-1),I_(t),u_(t),v_(t))>t, wherethe subscript refers to the image at time t, and t is a threshold levelfor motion. At 404, a comparison is made between patches from image I₁and I₂ using the sum of the absolute difference algorithm. At 406, theresult of the sum of the absolute difference algorithm is compared to athreshold value to determine whether a threshold level of motion existswithin the image patch. If SAD=t, no sufficient motion exists, and flowproceeds to 410. If at 406, SAD>t, then sufficient motion exists withinthe patch, and flow is to 408 where the object is designated forcontinued tracking. At 410, the system determines whether the currentimage patch is the last patch to be examined within the current image.If NO, the methodology returns to 402 where a new patch is selected. IfYES, then the system returns to 400 to acquire a new video image fromthe video source.

To reduce the computational load, the SAD algorithm is computed on asparse regular grid within the image. In one embodiment, the sparseregular grid is based on sixteen pixel centers. When the motiondetection methodology determines that an object has sufficient motion,then the system tracks the motion of the object. Again, in order tolimit (or reduce) the computational load, a position predictionalgorithm is used to predict the next position of the moving object. Inone embodiment, the prediction algorithm is a Kalman filter. However, itis to be appreciated that any position prediction algorithm can be used.

Note that the image operations may use the same SAD function on imagepatches, which allows for easy SIMD (Single-Instruction Multiple-Data,which architectures are essential in the parallel world of computers)optimization of the algorithm's implementation, which in turn allows itto run with sufficiently many trackers while still leaving CPU time tothe user.

The process of seeding process hypotheses based upon motion may placemore than one hypothesis on a given moving object. One advantage of thismultiple hypothesis approach is that a simple, fast, and imperfecttracking algorithm may be used. Thus if one tracker fails, another maybe following the object of interest. Once a given tracker has beenseeded, the algorithm updates the position of the object being followedusing the same function over successive frames.

Referring now to FIG. 5, there is illustrated a flow diagram of amethodology for tracking a moving object within an image, in accordancewith an aspect of the present invention. The methodology begins at 500where, after the motion detection methodology has identified thelocation of a moving object to be tracked, the next position of theobject is predicted. Once identified, the methodology utilizes aprediction algorithm to predict the position of the object in successiveframes. The prediction algorithm limits the computational burden on thesystem. In the successive frames, the moving object should be at thepredicted location, or within a narrow range centered on the predictedlocation. At 502, the methodology selects a small pixel window (e.g.,ten pixels) centered on the predicted location. Within this smallwindow, an, algorithm executes to determine the actual location of themoving object. At 504, the new position is determined by examining thesum of the absolute difference algorithm over successive video framesacquired at time t and time t−1. The actual location is determined byfinding the location (u_(t), v_(t)) that minimizes:

SAD(I_(t-1),u_(t-n),v_(t-1),I_(t),u_(t),v_(t)),

where I_(t) refers to the image at time t, I_(t-1) refers to the imageat time t−1, and where (u_(t), v_(t)) refers to the location at time t.Once determined, the actual position is updated, at 506. At 508, motioncharacteristics are evaluated to determine whether the motion is stillgreater that the threshold level required. What is evaluated is not onlythe SAD image-based computation, but also movement of the object overtime. The movement parameter is the average movement over a window oftime. Thus if the user pauses the object or hand for a short duration oftime, it may not be dropped from consideration. However, if the durationof time for the pause is still longer such that it exceeds apredetermined average time parameter, the object will be dropped. IfYES, the motion is sufficient, and flow returns to 500 where a newprediction for the next position is determined. If NO, the object motionis insufficient, and the given object is dropped from being tracked, asindicated by flow to 510. At 512, flow is to 430 of FIG. 4 to select anew patch in the image from which to analyze motion.

When determining the depth information of an object (i.e., the distancefrom the object to the display or any other chosen reference point), alightweight sparse stereo approach is utilized in accordance with anaspect of the invention. The sparse stereo approach is a region-basedapproach utilized to find the disparity at only locations in the imagecorresponding to the object hypothesis. Note that in the stereo matchingprocess, it is assumed that both cameras are parallel (in rasters).Object hypotheses are supported by frame-to-frame tracking through timein one view and stereo matching across both views. A second calibrationissue is the distance between the two cameras (i.e., the baseline),which must be considered to recover depth in real world coordinates. Inpractice, both calibration issues maybe dealt with automatically byfixing the cameras on a prefabricated mounting bracket orsemi-automatically by the user presenting objects at a known depth in acalibration routine that requires a short period of time to complete.The accuracy of the transform to world coordinates is improved byaccounting for lens distortion effects with a static, pre-computedcalibration procedure for a given camera.

Binocular disparity is the primary means for recovering depthinformation from two or more images taken from different viewpoints.Given the two-dimensional position of an object in two views, it ispossible to compute the depth of the object. Given that the two camerasare mounted parallel to each other in the same horizontal plane, andgiven that the two cameras have a focal length f, the three-dimensionalposition (x,y,z) of an object is computed from the positions of theobject in both images (u_(l), v_(l)) and (u_(r), v_(r)) by the followingperspective projection equations:

${u = {u_{r} = {f\frac{x}{z}}}};$ ${v = {v_{r} = {f\frac{y}{z}}}};$${d = {{u_{r} - u_{l}} = {f\frac{b}{z}}}};$

where the disparity, d, is the shift in location of the object in oneview with respect to the other, and is related to the baseline b, thedistance between the two cameras.

The vision algorithm performs 3-dimensional (3-D) tracking and 3-D depthcomputations. In this process, each object hypothesis is supported onlyby consistency of the object movement in 3-D. Unlike many conventionalcomputer vision algorithms, the present invention does not rely onfragile appearance models such as skin color models or hand imagetemplates, which are likely invalidated when environmental conditionschange or the system is confronted with a different user.

Referring now to FIG. 6, there is illustrated a disparity between twovideo images captured by two video cameras mounted substantiallyparallel to each other for the purpose of determining the depth ofobjects, in accordance with an aspect of the present invention. In FIG.6, a first camera 600 and a second camera 602 (similar to cameras 210)are mounted substantially parallel to each other in the same horizontalplane and laterally aligned. The two cameras (600 and 602) are separatedby a distance 604 defined between the longitudinal focal axis of eachcamera lens, also known as the baseline, b. A first video image 606 isthe video image from the first camera 600 and a second video image 608is the video image from the second camera 602. The disparity d (alsoitem number 610), or shift in the two video images (606 and 608), can beseen by looking to an object 612 in the center of the first image 606,and comparing the location of that object 612 in the first image 606 tothe location of that same object 612 in the second image 608. Thedisparity 610 is illustrated as the difference between a first verticalcenterline 614 of the first image 606 that intersects the center of theobject 612, and a second vertical centerline 616 of the second image608. In the first image 606, the object 612 is centered about thevertical centerline 614 with the top of the object 612 located at point(u, v). In the second image 608, the same point (u, v) of the object 612is located at point (u-d, v) in the second image 608, where d is thedisparity 610, or shift in the object from the first image 606 withrespect to the second image 610. Given disparity d, a depth z can bedetermined. As will be discussed, in accordance with one aspect of theinvention, the depth component z is used in part to determine if anobject is within the engagement volume, where the engagement volume isthe volume within which objects will be selected by the system.

In accordance with another aspect of the present invention, a sparsestereo approach is utilized in order to limit computationalrequirements. The sparse stereo approach is that which determinesdisparity d only at the locations in the image that corresponds to amoving object. For a given point (u, v) in the image, the value ofdisparity d is found such that the sum of the absolute differences overa patch in the first image 606 (i.e., a left image I_(L)) centered on(u, v) and a corresponding patch in the second image 608 (i.e., a rightimage I_(R)) centered on (u-d, v), is minimized, i.e., the disparityvalue d that minimizes SAD(I_(l)u-d,v,I_(r),u,v). If an estimate ofdepth z is available from a previous time, then in order to limitcomputational requirements, the search for the minimal disparity d islimited to a range consistent with the last known depth z.

In accordance with another aspect of the invention, the search range maybe further narrowed by use of an algorithm to predict the objects newlocation. In one embodiment, the prediction is accomplished byutilization of a Kalman filter.

The depth z can also be computed using traditional triangulationtechniques. The sparse stereo technique is used when the systemoperation involves detecting moving objects within a narrow range infront of the display, e.g., within twenty inches. In such cases, the twovideo cameras are mounted in parallel and can be separated by a distanceequal to the approximate width of the display, or a even smallerdistance that approximates a few inches. However, when the system isimplemented in a larger configuration, the distance between the twovideo cameras may be much greater. In such cases, traditionaltriangulation algorithms are used to determine the depth.

The foregoing discussion has focused on some details of themethodologies associated with locating and tracking an object to effectexecution of corresponding and specified commands. An overview followsas to how these capabilities are implemented in one exemplary system.

Referring now to FIG. 7, there is illustrated an example of gesturesthat the system recognizes, and further illustrates visual feedbackprovided to the system through the display. A user 700 gives commands byvirtue of different hand gestures 702 and/or verbal commands 704. Thegestures 702 are transmitted to a system computer (not shown) as part ofthe video images created by a pair of video cameras (706 and 708).Verbal and/or generally, audio commands, are input to the systemcomputer through a microphone 710. Typical GUI windows 712, 714, and 716are displayed in a layered presentation in an upper portion of display718 while a lower portion of display 718 provides visual graphicfeedback of in the form of icons 720, 722, 724, and 726 of some of thegestures 702 recognized by the system.

In one example, the hand icon 720 is displayed when a correspondinggesture 728 is recognized. The name of the recognized command (Move) isalso then displayed below the icon 720 to provide additional textualfeedback to the user 700. Move and Raise commands may be recognized bydwelling on the window for a period of time. There is also a “flick” or“bump” command to send a window from one monitor to another monitor, ina multiple monitor configuration. This is controlled by moving the hand(or object) to the left or right, and is described in greater detailhereinbelow with respect to FIG. 9B. There are at least two ways toeffect a Move; by speech recognition when voicing the word “Move”, orphrase “Move Window”, or any other associated voice command(s); and, byusing the dwelling technique. It is appreciated that where more robustimage capture and imaging processing systems are implemented, the poseof the hand may be mapped to any functionality, as described in greaterdetail below. Moreover, the shape of the hand icon may be changed inassociation with the captured hand pose to provide visual feedback tothe user that the correct hand pose is being processed. However, as abasic implementation, the hand icon is positioned for selecting thewindow for interaction, or to move the window, or effect scrolling.

A Scroll command may be initiated first by voicing a correspondingcommand that is processed by speech recognition, and then using the hand(or object) to commence scrolling of the window by moving the hand (orobject) up and down for the desired scroll direction.

In another example, the single displayed hand icon 720 is presented forall recognized hand gestures 702, however, the corresponding specificcommand name is displayed below the icon 720. Here, the same hand icon720 is displayed in accordance with four different hand gesturesutilized to indicate four different commands: Move, Close, Raise, andScroll.

In still another aspect of the present invention, a different handshaped icon is used for each specific command and the name of thecommand is optionally displayed below the command. In yet anotherembodiment, audio confirmation is provided by the computer, in additionto the displayed icon and optional command name displayed below theicon.

As previously mentioned, FIG. 7 illustrates the embodiment where asingle hand shaped icon 720 is used, and the corresponding commandrecognized by the system is displayed below the icon 720. For example,when the system recognizes, either by virtue of gestures (with handand/or object) and or verbal commands, the command to move a window, theicon 720 and corresponding command word “MOVE” are displayed by thedisplay 718. Similarly, when the system recognizes a command to close awindow, the icon 720 and corresponding command word “CLOSE” may bedisplayed by the display 718. Additional examples include, but are notlimited to, displaying the icon 720 and corresponding command word“RAISE” when the system recognizes a hand gesture to bring a GUI windowforward. When the system recognizes a hand gesture corresponding to ascroll command for scrolling a GUI window, the icon 720 and command word“SCROLL” are displayed by the display 718.

It is to be appreciated that the disclosed system may be configured todisplay any number and type of graphical icons in response to one ormore hand gestures presented by the system user. Additionally, audiofeedback may be used such that a beep or tone may be presented inaddition to or in lieu of the graphical feedback. Furthermore thegraphical icon may be used to provide feedback in the form of a color,combination of colors, and/or flashing color or colors. Feedback mayalso be provided by flashing a border of the selected window, the borderin the direction of movement. For example, if the window is to be movedto the right, the right window border could be flashed to indicate theselected direction of window movement. In addition to or separate from,a corresponding tone frequency or any other associated sound may beemitted to indicate direction of movement, e.g., an upward movementwould have and associated high pitch and a downward movement would havea low pitch. Still further, rotational aspects may be provided such thatmovement to the left effects a counterclockwise rotation of a move icon,or perhaps a leftward tilt in the GUI window in the direction ofmovement. Referring now to FIG. 8, there is illustrated an alternativeembodiment wherein a unique icon is displayed in association with a nameof a specific recognized command, in accordance with an aspect of thepresent invention. Here, each icon-word pair is unique for eachrecognized command. Icon-word pairs 800, 802, 804, and 806 for therespective commands “MOVE”, “CLOSE”, “RAISE”, and “SCROLL”, are examplesof visual feedback capabilities that can be provided.

The system is capable of interpreting commands based on interpretinghand gestures, verbal commands, or both in combination. A hand isidentified as a moving object by the motion detection algorithms and thehand movement is tracked and interpreted. In accordance with one aspectof the invention, hand gestures and verbal commands are usedcooperatively. Speech recognition is performed using suitable voicerecognition applications, for example, Microsoft SAPI 5.1, with a simplecommand and control grammar. However, it is understood that any similarspeech recognition system can be used. An inexpensive microphone isplaced near the display to receive audio input. However, the microphonecan be placed at any location insofar as audio signals can be receivedthereinto and processed by the system.

Following is an example of functionality that is achieved by combininghand gesture and verbal modalities. Interaction with the system can beinitiated by a user moving a hand across an engagement plane and into anengagement volume.

Referring now to FIG. 9A, there is illustrated the engagement plane andengagement volume for a single monitor system of the present invention.A user 900 is located generally in front of a display 902, which is alsowithin the imaging capabilities of a pair of cameras (906 and 908). Amicrophone 904 (similar to microphones 206 and 710) is suitably locatedsuch that user voice signals are input for processing, e.g., in front ofthe display 902. The cameras (906 and 908, similar to cameras 200 and,706 and 708) are mounted substantially parallel to each other and on ahorizontal plane above the display 902. The two video cameras (906 and908) are separated by a distance that provides optimum detection andtracking for the given cameras and the engagement volume. However, it isto be appreciated that cameras suitable for wider fields of view, higherresolution, may be placed further apart on a plane different from thetop of the display 902, for example, lower and along the sides of thedisplay facing upwards, to capture gesture images for processing inaccordance with novel aspects of the present invention. In accordancetherewith, more robust image processing capabilities and hypothesisengines can be employed in the system to process greater amounts ofdata.

Between the display 902 and the user 900 is a volume 910 defined as theengagement volume. The system detects and tracks objects inside andoutside of the volume 910 to determine the depth of one or more objectswith respect to the engagement volume 910. However, those objectsdetermined to be of a depth that is outside of the volume 910 will beignored. As mentioned hereinabove, the engagement volume 910 istypically defined to be located where the hands and/or objects in thehands of the user 900 are most typically situated, i.e., above akeyboard of the computer system and in front of the cameras (906 and908) between the user 900 and the display 902 (provided the user 900 isseated in front of the display on which the cameras (906 and 908) arelocated). However, is it appreciated that the user 900 may be standingwhile controlling the computer, which requires that the volume 910 belocated accordingly to facilitate interface interaction. Furthermore,the objects may include not only the hand(s) of the user, or objects inthe hand(s), but other parts of the body, such as head, torso movement,arms, or any other detectable objects. This is described in greaterdetail hereinbelow.

A plane 912 defines a face of the volume 910 that is closest to the user900, and is called the engagement plane. The user 900 may effect controlof the system by moving a hand (or object) through the engagement plane912 and into the engagement volume 910. However, as noted above, thehand of the user 900 is detected and tracked even when outside theengagement volume 910. However, it would be ignored when outside of theengagement volume 910 insofar as control of the computer is concerned.When the object is moved across the engagement plane 912, feedback isprovided to the user in the form of displaying an alpha-blended icon onthe display (e.g., an operating system desktop). The icon is designed tobe perceived as distinct from other desktop icons and may be viewed asan area cursor. The engagement plane 912 is positioned such that theuser's hands do not enter it during normal use of the keyboard andmouse. When the system engages the hand or object, the correspondinghand icon displayed on the desktop is moved to reflect the position ofthe tracked object (or hand).

The engagement and acquisition of the moving hand (or object) isimplemented in the lightweight sparse stereo system by looking for theobject with a depth that is less than a predetermined distance value.Any such object will be considered the command object until it is movedout of the engagement volume 910, for example, behind the engagementplane 912, or until the hand (or object) is otherwise removed from beinga tracked object. In one example, the specified distance is twentyinches.

In operation, the user 900 moves a hand through the engagement plane 912and into the engagement volume 910 established for the system. Thesystem detects the hand, tracks the hand as the hand moves from outsideof the volume 910 to the inside, and provides feedback by displaying acorresponding hand shaped icon on the display 902. The open microphone904 placed near the display 902 provides means for the user 900 toinvoke one or more verbal commands in order to act upon the selectedwindow under the icon. The window directly underneath the hand shapedicon is the selected window. When a spoken and/or audio command is inputto and understood by the system, the interpreted command is displayedalong with the hand shaped icon. For example, in one embodiment, byspeaking the word “Move”, the user may initiate the continuous (orstepped) movement of the window under the hand shaped icon to follow themovement of the user's hand. The user 900 causes the selected window tomove up or down within the display 902 by moving the hand up or down.Lateral motion is also similarly achieved. Movement of the window isterminated when the user hand is moved across the engagement plane 912and out of the engagement volume 910. Other methods of terminationinclude stopping movement of the hand (or object) for an extended periodof time, which is processed by the system as a command to drop theassociated hypothesis. Furthermore, as described hereinabove, the Movecommand may be invoked by dwelling the hand on the window for a periodof time, followed by hand motion to initiate the direction of windowmovement.

Alternatively, the user may speak the word “Release” and the system willstop moving the selected window in response to the user's hand motion.Release may also be accomplished by dwelling a bit longer in time whilein Move, and/or Scroll modes. The user 900 may also act upon a selectedwindow with other actions. By speaking the words, “Close”, “Minimize”,or “Maximize” the selected window is respectively closed, minimized ormaximized. By speaking the word “Raise”, the selected window is broughtto the foreground, and by speaking “Send to Back”, the selected windowis sent behind (to the background) all other open windows. By speaking“Scroll”, the user initiates a scrolling mode on the selected window.The user may control the rate of the scroll by the position of the hand.The hand shaped icon tracks the user's hand position, and the rate ofthe scrolling of the selected window is proportional to the distancebetween the current hand icon position and the position of the hand iconat the time the scrolling is initiated. Scrolling can be terminated bythe user speaking “Release” or by the user moving their hand behind theengagement plane and out of the engagement volume. These are just a fewexamples of the voice recognition perceptual computer controlcapabilities of the disclosed architecture. It is to be appreciated thatthese voiced commands may also be programmed for execution in responseto one or more object movements in accordance with the presentinvention.

In accordance with another aspect of the invention, dwell time can beused as a modality to control windows in lieu of, or in addition to,verbal commands and other disclosed modalities. Dwell time is defined asthe time, after having engaged the system, that the user holds theirhand position stationary such that the system hand shaped icon remainsover a particular window. For example, by dwelling on a selected windowfor a short period of time (e.g., two seconds), the system can bring thewindow to the foreground of all other open windows (i.e., a RAISEcommand). Similarly, by dwelling a short time longer (e.g., fourseconds), the system will grab (or select for dragging) the window, andthe user causes the selected window to move up or down within thedisplay by moving a hand up or down (i.e., a MOVE command). Lateralmotion is also similarly achieved. Additional control over GUI windowscan be accomplished in a similar fashion by controlling the dwell timeof the hand shaped icon over the open window.

In accordance with a more robust aspect of the invention, hand gesturesare interpreted by hand motion or by pattern recognition. For example,the user can bring the window to the front (or foreground), on top ofall other open windows by moving a hand from a position closer to thedisplay to position farther from the display, the hand remaining in theengagement volume 910. The use of 3-D imaging is described in greaterdetail hereinbelow. Similarly, the user can cause the selected window tobe grabbed and moved by bringing fingers together with their thumb, andsubsequently moving the hand. The selected window will move in relationto the user hand movement until the hand is opened up to release theselected window. Additional control over the selected window can bedefined in response to particular hand movements or hand gestures. Inaccordance with another aspect of the present invention, the selectedwindow will move in response to the user pointing their hand, thumb, orfinger in a particular direction. For example, if the user points theirindex finger to right, the window will move to the right within thedisplay. Similarly, if the user points to the left, up, or down theselected window will move to the left, up or down within the display,respectively. Additional window controls can be achieved through the useof similar hand gestures or motions.

In accordance with another aspect of the invention, the system isconfigurable such that an individual user selects the particular handgestures that they wish to associate with particular commands. Thesystem provides default settings that map a given set of gestures to agiven set of commands. This mapping, however, is configurable such thatthe specific command executed in response to each particular handgesture is definable by each user. For example, one user may wish topoint directly at the screen with their index finger to grab theselected window for movement while another user may wish to bring theirfingers together with their thumb to grab the selected window.Similarly, one user may wish to point a group of fingers up or down inorder to move a selected window up or down, while another user may wishto open the palm of their hand toward the cameras and then move theiropened hand up or down to move a selected window up or down. All givengestures and commands are configurable by the individual users to bestsuit that particular user's individual personal preferences.

Similarly, in accordance with another aspect of the present invention,the system may include a “Record and Define Gesture” mode. In the“Record and Define Gesture” mode, the system records hand gesturesperformed by the user. The recorded gestures are then stored in thesystem memory to be recognized during normal operation. The given handgestures are then associated with a particular command to be performedby the system in response to that particular hand gesture. With suchcapability, a user may further tailor the system to their personalpreference or, similarly, may tailor system operation to respond tospecific commands most appropriate for particular applications.

In a similar fashion, the user can choose the particular words, from agiven set, they wish to use for a particular command. For example, oneuser may choose to say “Release” to stop moving a window while anothermay wish to say, “Quit”. This capability allows different users, whichmay prefer to use different words for a given command, the ability totailor the system in a way most efficient for their personal use.

The present invention can be utilized in an expansive list ofapplications. The following discussion is exemplary of only a fewapplications with which the present invention may be utilized. One suchapplication is associated with user control of a presentation, orsimilar type of briefing application, wherein the user makes apresentation on a projection type screen to a group of listeners.

Referring now to FIG. 9B, there is illustrated a multiple monitorimplementation. Here, the system includes three monitors (or displays)through which the user 900 exercises control of GUI features; a firstdisplay 912, a second display 914, and a third display 916. The cameras(906 and 908) are similarly situated as in FIG. 9A, to define theengagement volume 910. By utilizing the “flick” or “bump” motion(s) asperformed by a hand 918 of the user 900, the user 900 can move a window920 from the first display 912 to the second display 914, and furtherfrom the second display 914 to the third display 916. The flick motionof the user hand 918 can effect movement of the window 920 from thefirst display 912 to the third display 916 in a single window movement,or in multiple steps through the displays (914 and 916) usingcorresponding multiple hand motions. Of course, control by the user 900occurs only when the user hand 918 breaks the engagement plane 912, andis determined to be a control object (i.e., an object meeting parameterssufficient to effect control of the computer).

As mentioned hereinabove, the user 900 is located generally in front ofthe displays (912, 914, and 916), which is also within the imagingcapabilities of the pair of cameras (906 and 908). The microphone 904 issuitably located to receive user voice signals. The cameras (906 and908) are mounted substantially parallel to each other and on ahorizontal plane above the displays (912, 914, and 916), and separatedby a distance that provides optimum detection and tracking for the givencameras and the engagement volume 910.

In operation, the user 900 moves the hand 918 through the engagementplane 912 and into the engagement volume 910 established for the system.The system, which had detected and tracked the hand 918 before itentered the volume 912, begins providing feedback to the user 900 bydisplaying the hand shaped icon 922 on one of the displays (912, 914,and 916). The microphone 904 provides additional means for the user 900to invoke one or more verbal commands in order to act upon the selectedwindow 920 under the corresponding icon 922. The window 920 directlyunderneath the hand shaped icon is the selected window. When the userhand 918 enters the volume 910, it is recognized as a control object.The corresponding icon 922 is presented by the system on the computerdisplay 912. By dwelling a predetermined amount of time, the associatedwindow is assigned for control. The user 900 causes the selected windowto move up or down within the display by invoking the ‘Move’ command asexplained above and then moving the hand up or down, or to move acrossone or more of the monitors (914 and 916) by invoking the ‘Flick’command and then using the flick hand motion. Of course, if the seconddisplay 914 was the initial point of control, the user 900 can cause thewindow 920 to be moved left to the first display 912, or right to thethird display 916. Movement of the window is terminated (or “released”)when the user hand dwells for a time longer than a predetermined dwelltime, or out of the engagement volume 910.

Alternatively, the user may speak the word “Release” and the system willstop moving the selected window in response to the user's hand motion.Release may also be accomplished by dwelling a bit while in Move, and/orScroll modes. The user may also act upon a selected window with otheractions. By speaking the words, “Close”, “Minimize”, or “Maximize” theselected window is respectively closed, minimized or maximized. Byspeaking the word “Raise”, the selected window is brought to theforeground, and by speaking “Send to Back”, the selected window is sentbehind (to the background) all other open windows. By speaking “Scroll”,the user initiates a scrolling mode on the selected window. The user maycontrol the rate of the scroll by the position of the hand. The handshaped icon tracks the user's hand position, and the rate of thescrolling of the selected window is proportional to the distance betweenthe current hand icon position and the position of the hand icon at thetime the scrolling is initiated. Scrolling can be terminated by the userspeaking “Release” or by the user moving their hand behind theengagement plane and out of the engagement volume. These are just a fewexamples of the voice recognition perceptual computer controlcapabilities of the disclosed architecture.

Referring now to FIG. 10, there is illustrated a briefing roomenvironment where voice and/or gestures are utilized to control a screenprojector via a computer system configured in accordance with an aspectof the present invention. The briefing room 1000 comprises a largebriefing table 1002 surrounded on three sides by numerous chairs 1004, acomputer 1006, a video projector 1008, and a projector screen 1010.Utilization of the present invention adds additional elements comprisingthe disclosed perceptual software 1012, two video cameras (1014 and1016) and a microphone 1018. In this application, a user 1020 ispositioned between the projector screen 1010 and briefing table 1002 atwhich the audience is seated. A top face 1022 of an engagement volume1024 is defined by rectangular area 1026. Similarly, a front surfaceindicated at 1028 represents an engagement plane.

As the user gives the presentation, the user controls the contentdisplayed on the projection screen 1010 and advancement of the slides(or presentation images) by moving their hand(s) through the engagementplane 1028 into the engagement volume 1024, and/or speaking commandsrecognizable by the system. Once inside the engagement volume 1024, asimple gesture is made to advance to the next slide, back-up to aprevious slide, initiate an embedded video, or to effect one of a numberof many other presentation capabilities.

A similar capability can be implemented for a home media center whereinthe user can change selected video sources, change channels, controlvolume, advance chapter and other similar functions by moving their handacross an engagement plane into an engagement volume and subsequentlyperforming the appropriate hand gesture. Additional applications includeperceptual interfaces for TabletPCs, Media center PCs, kiosks, hand heldcomputers, home appliances, video games, and wall sized displays, alongwith many others.

It is appreciated that in more robust implementations, instead of theengagement volume being fixed at a position associated with the locationof the cameras that requires the presenter to operate according to thelocation of the engagement volume, the system can be configured suchthat the engagement volume travels with the user (in a “roaming” mode)as the user moves about the room. Thus, the cameras would be mounted ona platform that rotates such that the rotation maintains the camerassubstantially equidistant from the presenter. The presenter may carry asensor (e.g., an RFID tag) that allows the system to sense or track thegeneral location of the presenter. The system would then affect rotationof the camera mount to “point” the cameras at the presenter. In responsethereto, the engagement volume may be extended to the presenter allowingcontrol of the computer system as the presenter moves about. The processof “extending” the engagement volume can include increasing the depth ofthe volume such that the engagement plane surface moves to thepresenter, or by maintaining the volume dimensions, but moving the fixedvolume to the presenter. This would require on-the-fly focal adjustmentof the cameras to track quick movements in the depth of objects in thevolume, but also the movement of the presenter.

Another method of triggering system attention in this mode would be toexecute a predefined gesture that is not likely to be madeunintentionally, e.g., raising a hand.

It is also appreciated that the system is configurable for individualpreferences such that the engagement volume of a first user may bedifferent from the volume of a second user. For example, in accordancewith a user login, or other unique user information, the userpreferences may be retrieved and implemented automatically by thesystem. This can include automatically elevating the mounted cameras fora taller person by using a telescoping camera stand so that the camerasare at the appropriate height of the particular user, whether sitting orstanding. This also includes, but is not limited to, setting the systemfor “roaming” mode.

Referring now to FIG. 11, there is illustrated a block diagram of acomputer operable to execute the present invention. In order to provideadditional context for various aspects of the present invention, FIG. 11and the following discussion are intended to provide a brief, generaldescription of a suitable computing environment 1100 in which thevarious aspects of the present invention may be implemented. While theinvention has been described above in the general context ofcomputer-executable instructions that may run on one or more computers,those skilled in the art will recognize that the invention also may beimplemented in combination with other program modules and/or as acombination of hardware and software. Generally, program modules includeroutines, programs, components, data structures, etc., that performparticular tasks or implement particular abstract data types. Moreover,those skilled in the art will appreciate that the inventive methods maybe practiced with other computer system configurations, includingsingle-processor or multiprocessor computer systems, minicomputers,mainframe computers, as well as personal computers, hand-held computingdevices, microprocessor-based or programmable consumer electronics, andthe like, each of which may be operatively coupled to one or moreassociated devices. The illustrated aspects of the invention may also bepracticed in distributed computing environments where certain tasks areperformed by remote processing devices that are linked through acommunications network. In a distributed computing environment, programmodules may be located in both local and remote memory storage devices.

A computer typically includes a variety of computer readable media.Computer readable media can be any available media that can be accessedby the computer and includes both volatile and nonvolatile media,removable and non-removable media. By way of example, and notlimitation, computer readable media can comprise computer storage mediaand communication media. Computer storage media includes both volatileand nonvolatile, removable and non-removable media implemented in anymethod or technology for storage of information such as computerreadable instructions, data structures, program modules or other data.Computer storage media includes, but is not limited to, RAM, ROM,EEPROM, flash memory or other memory technology, CD-ROM, digital videodisk (DVD) or other optical disk storage, magnetic cassettes, magnetictape, magnetic disk storage or other magnetic storage devices, or anyother medium which can be used to store the desired information andwhich can be accessed by the computer. Communication media typicallyembodies computer readable instructions, data structures, programmodules or other data in a modulated data signal such as a carrier waveor other transport mechanism and includes any information deliverymedia. The term “modulated data signal” means a signal that has one ormore of its characteristics set or changed in such a manner as to encodeinformation in the signal. By way of example, and not limitation,communication media includes wired media such as a wired network ordirect-wired connection, and wireless media such as acoustic, RF,infrared and other wireless media. Combinations of the any of the aboveshould also be included within the scope of computer readable media.

With reference again to FIG. 11, the exemplary environment 1100 forimplementing various aspects of the invention includes a computer 1102,the computer 1102 including a processing unit 1104, a system memory1106, and a system bus 1108. The system bus 1108 couples systemcomponents including, but not limited to the system memory 1106 to theprocessing unit 1104. The processing unit 1104 may be any of variouscommercially available processors. Dual microprocessors and othermulti-processor architectures also can be employed as the processingunit 1104.

The system bus 1108 can be any of several types of bus structureincluding a memory bus or memory controller, a peripheral bus, and alocal bus using any of a variety of commercially available busarchitectures. The system memory 1106 includes read only memory (ROM)1110 and random access memory (RAM) 1112. A basic input/output system(BIOS), containing the basic routines that help to transfer informationbetween elements within the computer 1102, such as during start-up, isstored in the ROM 1110.

The computer 1102 further includes a hard disk drive 1114, a magneticdisk drive 1116, (e.g., to read from or write to a removable disk 1118)and an optical disk drive 1120, (e.g., reading a CD-ROM disk 1122 or toread from or write to other optical media). The hard disk drive 1114,magnetic disk drive 1116 and optical disk drive 1120 can be connected tothe system bus 1108 by a hard disk drive interface 1124, a magnetic diskdrive interface 1126 and an optical drive interface 1128, respectively.The drives and their associated computer-readable media providenonvolatile storage of data, data structures, computer-executableinstructions, and so forth. For the computer 1102, the drives and mediaaccommodate the storage of broadcast programming in a suitable digitalformat. Although the description of computer-readable media above refersto a hard disk, a removable magnetic disk and a CD, it should beappreciated by those skilled in the art that other types of media whichare readable by a computer, such as zip drives, magnetic cassettes,flash memory cards, digital video disks, cartridges, and the like, mayalso be used in the exemplary operating environment, and further thatany such media may contain computer-executable instructions forperforming the methods of the present invention.

A number of program modules can be stored in the drives and RAM 1112,including an operating system 1130, one or more application programs1132, other program modules 1134 and program data 1136. It isappreciated that the present invention can be implemented with variouscommercially available operating systems or combinations of operatingsystems.

A user can enter commands and information into the computer 1102 througha keyboard 1138 and a pointing device, such as a mouse 1140. Other inputdevices (not shown) may include one or more video cameras, one ormicrophones, an IR remote control, a joystick, a game pad, a satellitedish, a scanner, or the like. These and other input devices are oftenconnected to the processing unit 1104 through a serial port interface1142 that is coupled to the system bus 1108, but may be connected byother interfaces, such as a parallel port, a game port, an IEEE 1394serial port, a universal serial bus (“USB”), an IR interface, etc. Amonitor 1144 or other type of display device is also connected to thesystem bus 1108 via an interface, such as a video adapter 1146. Inaddition to the monitor 1144, a computer typically includes otherperipheral output devices (not shown), such as speakers, printers etc.

The computer 1102 may operate in a networked environment using logicalconnections to one or more remote computers, such as a remotecomputer(s) 1148. The remote computer(s) 1148 may be a workstation, aserver computer, a router, a personal computer, portable computer,microprocessor-based entertainment appliance, a peer device or othercommon network node, and typically includes many or all of the elementsdescribed relative to the computer 1102, although, for purposes ofbrevity, only a memory storage device 1150 is illustrated. The logicalconnections depicted include a LAN 1152 and a WAN 1154. Such networkingenvironments are commonplace in offices, enterprise-wide computernetworks, intranets and the Internet.

When used in a LAN networking environment, the computer 1102 isconnected to the local network 1152 through a network interface oradapter 1156. When used in a WAN networking environment, the computer1102 typically includes a modem 1158, or is connected to acommunications server on the LAN, or has other means for establishingcommunications over the WAN 1154, such as the Internet. The modem 1158,which may be internal or external, is connected to the system bus 1108via the serial port interface 1142. In a networked environment, programmodules depicted relative to the computer 1102, or portions thereof, maybe stored in the remote memory storage device 1150. It will beappreciated that the network connections shown are exemplary and othermeans of establishing a communications link between the computers may beused.

Further, a camera 1160 (such as a digital/electronic still or videocamera, or film/photographic scanner) capable of capturing a sequence ofimages 1162 can also be included as an input device to the computer1102. While just one camera 1160 is depicted, multiple cameras 1160could be included as input devices to the computer 1102. The images 1162from the one or more cameras 1160 are input into the computer 1102 viaan appropriate camera interface 1164. This interface 1164 is connectedto the system bus 1108, thereby allowing the images 1162 to be routed toand stored in the RAM 1112, or one of the other data storage devicesassociated with the computer 1102. However, it is noted that image datacan be input into the computer 1102 from any of the aforementionedcomputer-readable media as well, without requiring the use of the camera1160.

Referring now to FIG. 12, there is illustrated a network implementation1200 of the present invention. The implementation 1200 includes a firstperceptual system 1202 and a second perceptual system 1204, bothoperational according to the disclosed invention. The first system 1202includes cameras 1206 (also denoted C1 and C2) mounted on a rotationaland telescoping camera mount 1208. A first user 1210 located generallyin front of the first system 1202 effects control of a GUI Content A ofthe first system 1202 in accordance with the novel aspects of thepresent invention by introducing hand gestures into an engagement volume1211 and/or voice signals via a microphone. The first user 1210 may roamabout in front of the cameras 1206 in accordance with the “roaming”operational mode described previously, or may be seated in front of thecameras 1206. The second system 1204 includes cameras 1212 (also denotedC3 and C4) mounted on a rotational and telescoping camera mount 1214. Asecond user 1216 located generally in front of the second system 1204effects control of a GUI Content B of the second system 1204 inaccordance with the novel aspects of the present invention byintroducing hand gestures into an engagement volume 1217 and/or voicesignals using a microphone. The second user 1216 may roam about in frontof the cameras 1212 in accordance with the “roaming” operational modedescribed previously, or may be seated in front of the cameras 1212.

The first and second systems (1202 and 1204) may be networked in aconventional wired or wireless network 1207 peer configuration (or busconfiguration by using a hub 1215). This particular system 1200 isemployed to present both Content A and Content B via a single largemonitor or display 1218. Thus the monitor 1218 can be driven by eitherof the systems (1202 and 1204), as can be provided by conventionaldual-output video graphics cards, or the separate video information maybe transmitted to a third monitor control system 1220 to present thecontent. Such an implementation finds application where a side-by-sidecomparison of product features is being presented, or other similarapplications where two or more users may desire to interact. Thus,Content A and Content B can be presented on a split screen layout of themonitor 1218. Either or both users (1210 and 1216) can also providekeyboard and/or mouse input to facilitate control according to thepresent invention.

3-D Imaging Implementations

Referring now to FIG. 13, there is illustrated a medical operating roomsystem 1300 that uses the engagement volume in accordance with thepresent invention. An operating room 1302 includes an operating table1304 on which a patient 1305 is placed. A doctor (or medical person)1306 is positioned to one side of the table 1304 in order to effectivelyoperate on the patient 1305. However, it is to be appreciated that themedical person 1306 may be required to move around the table 1304 andoperate from various positions and angles.

The operating room system 1300 also includes an operation computersystem 1308 used by the medical person 1306 to facilitate the operation.In this particular embodiment, the operation computer system 1308comprises three computer systems: a first computer system 1310, a secondcomputer system 1312, and a third computer system 1314. The first system1310 includes a first monitor (or display) 1316, the second system 1312includes a second display 1318, and the third system 1314 includes athird display 1320. Medical information related to the patient 1305 canbe displayed on the any one or more of the monitors (1316, 1318 and1320) before, during, and/or after the operation. Note that the computerand displays can be oriented or positioned in any manner suitable foreasy use and viewing by operating room personnel.

The operation computing system 1308 also includes at least a pair ofcameras 1322 suitably designed for capturing images of at least thehands, arms, head, and general upper torso appendage positions, to thelevel of hand and finger positions of the medical person 1306. Thecameras 1322 can be connected to a single computer system for the inputof image data, and thereafter, the image data distributed among thecomputing systems (1310, 1312, and 1314) for processing. The threecomputer systems (1310, 1312, and 1314) are networked on a wired network1324, which network 1324 can connect to a larger hospital orfacility-wide network, for example. Note that it is not required to havethree computer systems. Alternatively, in such environments where thenetwork 1324 can present a bottleneck to such data transfers, a gigabitor faster network can be employed internally and locally for high-speedcommunication of the image data between the computer systems (1310,1312, and 1314) or to a fourth computer system (not shown) on the localhigh-speed network that can more efficiently and quickly process andpresent the image data to any one or more of the displays (1316, 1318,and 1320). The disclosed invention is not restricted to more computersor fewer computers. This is to indicate that the system can employ aplurality of computers for presenting the same information from multipleperspectives (as could be beneficial in an operating room environment),and different information from each system, for example.

In one implementation, the operation computing system 1308 develops anengagement volume 1326 above the operating table 1304, which volumeenvelops part or all the operation area of patient 1305. Thus, the table1304, patient 1305, and volume 1326 are all at a height suitable foroperation such that the hands of the medical person 1306 can engage thevolume 1326 at an appropriate height to be detected and tracked by thecomputing system 1308. Hand gestures of the medical person 1306 are thenimaged, tracked, and processed, as described hereinabove, and morespecifically, with respect to

FIG. 9, to facilitate controlling the presentation of information on oneor more of the displays (1316, 1318, and 1320) via associated computingsystems, as this can also entail audio I/O. In support of voicecommands, the medical person 1306 can be outfitted with a wirelessportable microphone system 1328 that includes a power supply,microphone, and transmitter for communicating wirelessly with a computerwireless transceiver system 1330 of the operation computer system 1308.Thus, voice commands alone or in combination with hand gestures can beused to facilitate the operation.

Referring now to FIG. 14, there is illustrated a medical operating roomenvironment 1400 in which a computer control system 1404 with a wirelesscontrol device 1404 is employed. The system 1404 also includes the useof the wireless control device 1404 for control thereof. Here, theengagement volume of FIG. 13 is no longer required or used onlymarginally. Continuing with the operating room implementation, themedical person 1306 uses the wireless remote control user interface (UI)device 1404 (hereinafter referred to as a “wand”) to facilitate controlof the operation computer system 1308. The wand 1404 can be positionedon a headpiece 1406 worn by the medical person 1306 to provide the freeuse of hands during the procedure. The wand 1404 is oriented in parallelwith the line of sight (or also called, “gaze”) of the person 1306 suchthat when the person's line of sight is to the system 1308, this isdetected as an interaction to be processes by the system 1308. All theperson 1306 needs to do is perform head movements to facilitate controlof the operation computing system 1308. The wand 1404 includes one ormore sensors the outputs of which are transmitted to the transceiversystem 1330 and forwarded to the operation computing system 1308 forprocessing. The wand 1404 and associated computing system and imagingcapabilities are described in the following pending U.S. patentapplications: Ser. No. 10/160,692, entitled “A SYSTEM AND PROCESS FORSELECTING OBJECTS IN A UBIQUITOUS COMPUTING ENVIRONMENT,” filed May 31,2002, and Ser. No. 10/160,659, entitled “A SYSTEM AND PROCESS FORCONTROLLING ELECTRONIC COMPONENTS IN A UBIQUITOUS COMPUTING ENVIRONMENTUSING MULTIMODAL INTEGRATION,” filed May 31, 2002, both of which arehereby incorporated by reference.

In general, the system 1402 includes the aforementioned wand 1404 in theform of the wireless radio frequency (RF) pointer, which includes an RFtransceiver and various orientation sensors. The outputs of the sensorsare periodically packaged as orientation signals and transmitted usingthe RF transceiver to the computer transceiver 1330, which also has a RFtransceiver to receive the orientation messages transmitted by the wand1404. The orientation signals of the wand 1404 are forwarded to thecomputer system 1308. The computer system 1308 is employed to computethe orientation and location of the wand 1404 using the orientationsignals, as are images of the wand 1404 captured by the cameras 1322.The orientation and location of the wand 1404 is in turn used todetermine if the wand 1404 is being pointed at an object in theoperating room environment 1400 that is controllable by the computersystem 1308 via the network 1324, such as one of the displays (1316,1318, or 1320). If so, the object is selected.

The wand 1404 specifically includes a case having a shape with a definedpointing end, a microcontroller, the aforementioned RF transceiver andorientation sensors which are connected to the microcontroller, and apower supply (e.g., batteries) for powering these electronic components.The orientation sensors of the wand 1404 include at least, anaccelerometer, which provides separate x-axis and y-axis orientationsignals, and a magnetometer, which provides separate tri-axialmeasurements (x-axis, y-axis, and z-axis) orientation signals. Theseelectronics are housed in a case that resembles a handheld wand.However, the packaging can be of any form factor such that thefunctionality of the wand 1404 can be used for the particular purpose.

As indicated previously, the orientation signals generated by the wand1404 include the outputs of the sensors. To this end, the wandmicrocontroller periodically reads and stores the outputs of theorientation sensors. Whenever a request for an orientation signal isreceived (or it is time to generate such a signal if the pointer isprogrammed to do so without a request), the microcontroller includes thelast-read outputs from the accelerometer and magnetometer in theorientation signal.

The wand 1404 also includes other electronic components such as a useractivated switch or button, and a series of light emitting diodes(LEDs). The user-activated switch, which is also connected to themicrocontroller, is employed for the purpose of instructing the computerto implement a particular function, such as will be described later. Tothis end, the state of the switch in regard to whether it is activatedor deactivated at the time an orientation message is packaged isincluded in that message for transmission to the computer. The series ofLEDs includes a pair of differently colored, visible spectrum LEDs,which are connected to the microcontroller, and which are visible fromthe outside of the pointer's case when lit. These LEDs are used toprovide status or feedback information to the user, and are controlledvia instructions transmitted to the pointer by the computer.

However, as will be described in greater detail hereinbelow, since thewand 1404 includes at least one motion sensor, the user activated switchcan be implemented in an alternative manner using hands-free controlthereof via head movements, for example, or a combination of voiceactivation, and/or head movement, just to name a few.

The foregoing system 1402 is utilized to select an object by having theuser simply point to the object or feature with the wand 1404. Thisentails the computer system 1308 first receiving the orientation signalstransmitted by the wand 1404. For each message received, the computer1308 derives the orientation of the wand 1404 in relation to apredefined coordinate system of the environment in which the wand 1404is operating using the orientation sensor readings contained in themessage. In addition, the video output from the video cameras 1322 isused to ascertain the location of the wand 1404 at a time substantiallycontemporaneous with the generation of the orientation signals and interms of the predefined coordinate system. Once the orientation andlocation of the wand 1404 are computed, they are used to determinewhether the wand 1404 is being pointed at an object in the environmentthat is controllable by the computer system 1308. If so, then thatobject is selected for future control actions.

The computer system 1308 derives the orientation of the wand 1404 fromthe orientation sensor readings contained in the orientation signals, asfollows. First, the accelerometer and magnetometer output valuescontained in the orientation signals are normalized. Angles defining thepitch of the wand 1404 about the x-axis and the roll of the device aboutthe y-axis are computed from the normalized outputs of theaccelerometer. The normalized magnetometer output values are thenrefined using these pitch and roll angles. Next, previously establishedcorrection factors for each axis of the magnetometer, which relate themagnetometer outputs to the predefined coordinate system of theenvironment, are applied to the associated refined and normalizedoutputs of the magnetometer. The yaw angle of the wand 1404 about thez-axis is computed using the refined magnetometer output values. Thecomputed pitch, roll and yaw angles are then tentatively designated asdefining the orientation of the wand 1404 at the time the orientationsignals are generated.

It is next determined whether the wand 1404 is in a right-side up orup-side down position at the time the orientation signals weregenerated. If the wand 1404 was in the right-side up position, thepreviously computed pitch, roll and yaw angles are designated as thedefining the finalized orientation of the wand 1404. However, if it isdetermined that the wand 1404 was in the up-side down position at thetime the orientation message was generated, the tentatively designatedroll angle is corrected accordingly, and then the pitch, yaw andmodified roll angle are designated as defining the finalized orientationof the wand 1404.

In the foregoing description, it is assumed that the accelerometer andmagnetometer of the wand 1404 are oriented such that their respectivefirst axis corresponds to the x-axis which is directed laterally to apointing axis of the wand 1404, and their respective second axiscorresponds to the y-axis, which is directed along the pointing axis ofthe wand 1404, and the third axis of the magnetometer corresponds to thez-axis, which is directed vertically upward when the wand 1404 ispositioned right-side up with the x and y axes lying in a horizontalplane.

The computer system 1308 derives the location of the wand 1404 from thevideo output of the video cameras 1322, as follows. In the wand 1404,there is an infrared (IR)

LED connected to a microcontroller that is able to emit IR light outsidethe wand 1404 case when lit. The microcontroller causes the IR LEDs toflash. In addition, the aforementioned pair of digital video cameras1322 each have an IR pass filter that results in the video image framescapturing only IR light emitted or reflected in the environment towardthe cameras 1322, including the flashing from the wand 1404 IR LED whichappears as a bright spot in the video image frames. The microcontrollercauses the IR LED to flash at a prescribed rate that is approximatelyone-half the frame rate of the video cameras 1322. This results in onlyone of each pair of image frames produced by a camera having the IR LEDflashes depicted in it. This allows each pair of frames produced by acamera to be subtracted to produce a difference image, which depicts forthe most part only the IR emissions and reflections directed toward thecamera which appear in one or the other of the pair of frames but notboth (such as the flash from the IR LED of the pointing device). In thisway, the background IR in the environment is attenuated and the IR flashbecomes the predominant feature in the difference image.

The image coordinates of the pixel in the difference image that exhibitsthe highest intensity is then identified using a standard peak detectionprocedure. A conventional stereo image technique is employed to computethe 3-D coordinates of the flash for each set of approximatelycontemporaneous pairs of image frames generated by the pair of cameras1322 using the image coordinates of the flash from the associateddifference images and predetermined intrinsic and extrinsic cameraparameters. These coordinates represent the location of the wand 1404(as represented by the location of the IR LED) at the time the videoimage frames used to compute the coordinates were generated by thecameras 1322.

The orientation and location of the wand 1404 at any given time is usedto determine whether the wand 1404 is being pointed at an object in theenvironment that is controllable by the computer system 1308. In orderto do so, the computer system 1308 must know what objects arecontrollable and where they exist in the environment. This requires amodel of the environment. In the present system and process, thelocation and extent of objects within the environment that arecontrollable by the computer system 1308 are modeled using 3-D Gaussianblobs defined by a location of the mean of the blob in terms of itsenvironmental coordinates and a covariance.

At least two different methods have been developed to model objects inthe environment. The first method involves the user inputtinginformation identifying the object that is to be modeled. The user thenactivates the switch on the pointing device and traces the outline ofthe object. Meanwhile, the computer system 1308 is running a targettraining procedure that causes requests for orientation signals to besent to the wand 1404 at a prescribed request rate. The orientationsignals are input when received, and for each orientation signal, it isdetermined whether the switch state indicator included in theorientation signal indicates that the switch is activated. Whenever itis initially determined that the switch is not activated, the switchstate determination action is repeated for each subsequent orientationsignal received until an orientation signal is received that indicatesthe switch is activated. At that point, each time it is determined thatthe switch is activated, the location of the wand 1404 is ascertained;as described previously, using the digital video input from the pair ofvideo cameras 1322. When the user is done tracing the outline of theobject being modeled, he or she deactivates the switch. The targettraining (or calibration) process detects this as the switch having beendeactivated after first having been activated in the immediatelypreceding orientation signal. Whenever such a condition occurs, thetracing procedure is deemed to be complete and a 3-D Gaussian blobrepresenting the object is established using the previously ascertainedwand locations stored during the tracing procedure.

The second method of modeling objects during a calibration process onceagain begins by the user inputting information identifying the objectthat is to be modeled. However, in this case the user repeatedly pointsthe wand 1404 at the object and momentarily activates the switch on thewand 1404, each time pointing the wand 1404 from a different locationwithin the environment. Meanwhile, the computer system 1308 is running atarget training algorithm that causes requests for orientation signalsto be sent to the wand 1404 at a prescribed request rate. Eachorientation message received from the wand 1404 is input until the userindicates the target training inputs are complete.

For each orientation signal input, it is determined whether the switchstate indicator contained therein indicates that the switch isactivated. Whenever it is determined that the switch is activated, theorientation of the wand 1404 is computed, as described previously, usingorientation sensor readings also included in the orientation message. Inaddition, the location of the wand 1404 is ascertained using theinputted digital video from the pair of video cameras 1322. The computedorientation and location values are stored.

Once the user indicates the target training inputs are complete, thelocation of the mean of a 3-D Gaussian blob that will be used torepresent the object being modeled is computed from the storedorientation and location values of the wand 1404. The covariance of theGaussian blob is then obtained in one of various ways. For example, itcan be a prescribed covariance, a user input covariance, or thecovariance can be computed by adding a minimum covariance to the spreadof the intersection points of rays defined by the stored orientation andlocation values of the wand 1404.

With a Gaussian blob model of the environment in place, the orientationand location of the wand 1404 is used to determine whether the wand 1404is being pointed at an object in the environment that is controllable bythe computer system 1308. In one version of this procedure, for eachGaussian blob in the model, the blob is projected onto a plane that isnormal to either a line extending from the location of the wand 1404 tothe mean of the blob, or a ray originating at the location of the wand1404 and extending in a direction defined by the orientation of the wand1404. The value of the resulting projected Gaussian blob at a pointwhere the ray intersects the plane, is computed. This value representsthe probability that the wand 1404 is pointing at the object associatedwith the blob under consideration.

Next, the probability is computed that represents the largest valuecomputed for the Gaussian blobs, if any. At this point, the objectassociated with the Gaussian blob from which the largest probabilityvalue was derived could be designated as being the object at which thewand 1404 is pointing. However, an alternative thresholding procedurecould be employed instead. In this alternate version, it is firstdetermined whether the largest probability value exceeds a prescribedminimum probability threshold. Only if the threshold is exceeded is theobject associated with the projected Gaussian blob from which thelargest probability value was derived designated as being the object atwhich the wand 1404 is pointing. The minimum probability threshold ischosen to ensure the user is actually pointing at the object and notjust near the object without an intent to select it.

In an alternate procedure for determining whether the wand 1404 is beingpointed at an object in the environment 1400 that is controllable by thecomputer system 1308, for each Gaussian blob, it is determined whether aray originating at the location of the wand 1404 and extending in adirection defined by the orientation of the wand 1404 intersects theblob. Next, for each Gaussian blob intersected by the ray, it isdetermined what the value of the Gaussian blob is at a point along theray nearest the location of the mean of the blob. This value representsthe probability that the wand 1404 is pointing at the object associatedwith the Gaussian blob. The rest of the procedure is similar to thefirst method, in that, the object associated with the Gaussian blob fromwhich the largest probability value was derived could be designated asbeing the object at which the wand 1404 is pointing. Alternatively, itis first determined whether the probability value identified as thelargest exceeds the prescribed minimum probability threshold. If thethreshold is exceeded, only then is the object associated with theprojected Gaussian blob from which the largest probability value wasderived designated as being the object at which the wand 1404 ispointing.

Hands-free control of the operation computing system 1308 using the headmounted wand 1404 involves generating at least a series of calibratedhead movements. Moreover, since the person 1306 also uses the wirelessmicrophone system 1328, voice commands can be employed alone or incombination with the head movements to enhance control of the computersystem 1308. With the implementation of one or more motion sensorstherein, e.g., accelerometers, velocity and/or acceleration data can bemeasured and resolved as the “switch” signal of the wand 1404 toinitiate or terminate an action without physically having to move aswitch with a finger, which would be extremely cumbersome and risky(insofar at least as sterilization and the transmission of germs isinvolved) in an operating room environment. For example, when the system1308 determines that the gaze of the medical person 1306 is at thesecond display 1318, a simple left-right head movement can beinterpreted to initiate a paging action such that displayed images arechanged similar to a person thumbing through pages of a book.Thereafter, an up-down head nod could be used to stop the pagingprocess. Alternatively, the paging process could be initiated by voicecommand after the system 1308 ascertains that the gaze is directed atthe second display 1318.

If more than one wand 1404 was employed by operating room personnel, thewands can be uniquely identified by an RF tagging system, such thatsignals transmitted to the computer system 1308 are interpreted inassociation with different personnel. For example, the doctor in chargeof the operation and his or her assisting nurse could each have a headmounted wand. The system 1308 can be suitably designed to discriminatethe wand signals according to a unique tag ID that accompanies eachsignal transmitted to the computer system 1308. Such tagging system canalso be used as a method of prioritizing signals for controlling thecomputer. For example, the system can be configured to prioritizesignals received from the doctor over those signals received from theassisting nurse.

In a more sophisticated implementation, the computer system 1308 employsthe classifier system described hereinabove to learn the movements ofpersonnel over time. For example, body movements of one person typicallydiffer from the way a body movement of another may be used to controlthe system 1308. Thus, instead of the user of the wand 1404 conformingto rigid criteria of signaling required by the computer systemalgorithm, the system 1308 can employ the classifier to learn theparticular movements of a given user. Once the user “logs in” to thesystem 1308, these customized movement signals (and voice signals, forexample) can then be activated for use by the system 1308 for that user.

It is to be appreciated that once the use of a remote wireless system1404 is employed, other internal and external signals can be inputthereto for transmission to and control of the system 1308. For example,the heart rate of the person 1306 can be monitored and input to the wandsystem 1404 or wireless voice system 1328 for wireless input to thesystem 1308 to monitor the state of the person 1306. If, during alengthy operation, the system 1308 detects that the physical conditionof the person 1306 is deteriorating, the classifier can be used tomodify how the movement and voice signals are processed for controllingthe system 1308. A faster heart rate can indicate faster speech and/orhead movements that would then be compensated for in the system 1308using the classifier. Of course, these parameters would be determined ona user-by-user basis.

In accordance with the orientation signals received from the wand 1404,the system 1308 can determine a number of factors about the person 1306.The system 1308 can determine when the person 1306 (or what person(s))is looking at the system 1308. For example, if the orientation of thewand 1404 indicates that the head position (or gaze) of the person 1306matches orientation associated with looking at any of the three monitors(1316, 1318, or 1320), here, the second monitor 1318, the system 1308then responds according to signals received thereafter, until theviewing session associated with the second monitor 1318 is terminated byvoice and/or head movements.

Where only one wand 1404 is provided, the system 1308 can re-associatethe wand 1404 with a user profile of another person 1306 who will usethe wand 1404. There exists a database of user profiles and tagassociations such that invocation of the wand tag (or ID) with the userlog-in name automatically executes the user profile for use with thewand 1404. This way, individualized user commands in the form of headmovements, voice commands, etc., are automatically invoked at the userlog-in process.

The system 1308 can also employ a bi-directional activation schemewherein the user initiates a user command for starting a session, andthe system 1308 responds with a signal that further requires a userresponse to confirm that a session is to begin. For example, the person1306 can initiate a session by motioning an up-down head nod repeatedlyfor three cycles. The system 1308 receives the corresponding threecycles of up-down nod signals that are interpreted to start a session ofthat person 1306. In order to ensure that the head nod was not madeinadvertently, the system 1308 responds by presenting an image on thefirst display 1316, and at which the person 1306 must point the wand1404 to confirm the start of the session. Of course, other signals canbe used to confirm session start. For example, the user can look to theceiling, which orientation of the wand 1404 in a substantially verticaldirection is interpreted to confirm the start of a session. Obviously,the number and combination of head movements and/or voice commands thatcan be employed in the present system are numerous, and can be used inaccordance with user preferences.

In the system 1402, the transceiver system 1330 can be used for wirelesscommunication for both the wand system 1404 and voice communicationssystem 1328. Thus, the wand link can be of one frequency, and the voicecommunication link another frequency. The computer system 1308 isconfigured to accommodate both by proving frequency discrimination andprocessing so that signal streams can be filtered and processed toextract the corresponding wand and voice signals.

Referring now to FIG. 15, there is illustrated a flowchart of a processfrom the perspective of the person for using the system of FIG. 14. At1500, the user performs a calibration process that comprises associatinga number of head movements and/or voice commands with user commands.This also includes using voice commands singly to control the operationcomputer system or in combination with the head movements to do so. Thecalibration process can be performed well in advance of use in theoperating room, and updated as the user chooses to change movementsand/of voice signals with user commands. At 1502, the person initiates asession with the computer system using one or more user commands. At1504, the person then inputs one or more of the user commands to controlthe computing system. At 1506, the person terminates the session usingone or more of the user commands. The process then reaches a Stop block.

Referring now to FIG. 16, there is illustrated a flowchart of a processfrom the perspective of the system of FIG. 14. At 1600, the calibrationprocess occurs where the system associates wand device signals and/orvoice signals with user commands specified by the person. Thecalibration process ends. At 1602, the user wand signals are receivedand processed by the system. At 1604, the system determines if theprocessed user command(s) indicate that a session is to be started withthat user. If NO, flow is back to the input of 1602 to continue toprocessing received device and voice signals. If YES, flow is to 1606 toidentify the user. This can occur by the system processing the receivedsignal and extracting the device tag ID. Prior to use, the tag ID of thewand is programmed for association with a given user. At 1608, the userprofile of calibration data is activated for use. Of course, on anygiven operation, the operating staff are assigned such that the log-innames of the doctors and assistants can be entered prior to beginningthe operation. Thus, the user profiles are already activated forprocessing.

At 1610, the device signals are received and processed to determine thetag ID of the device and to process the user command(s) against theassociated profile information to enable the related command. At 1612,where a classifier is employed, the classifier tracks, processes,compares, and updates the user profile when wand movements associatedwith the particular user command are changed within certain criteria. At1614, the computer system determines if the session has completed. IfNO, flow is back to the input of 1610 to continue to process usercommands. If YES, flow is to 1616 to terminate the user session. Theprocess then reaches a Stop block. Of course, from 1616, flow can bebrought back to the input of 1602 to continue to process signals fromother devices or to prepare for another session, which could occur manytimes during the operating room event.

Referring now to FIG. 17, there is illustrated a medical environment1700 in which a 3-D imaging computer control system 1702 is employed toprocess hand (or body) gestures in accordance with the presentinvention. The operation computing system 1308 provides 3-D imagerecognition and processing capability such that the engagement volume ofFIG. 13 and the wand 1404 of FIG. 14 are no longer required. The system1702 can be augmented with voice commands in a manner similar to thatdescribed above; however, this is not needed. Audio-visual co-analysiscan be used to improve continuous gesture recognition. Here, thetransceiver system 1330 is used only for wireless voice communication,when vocalization is employed. For example, the medical person 1306 cansimply use the system 1308 as a dictaphone to record voice signalsduring the operation.

The foregoing system 1702 is used to select an object under computercontrol in the environment by the computer system 1308 by having theuser simply make one or more hand gestures. Of course, this can be doneusing both hands, which feature will be described in greater detailhereinbelow. This entails the computer system 1308 capturing imaginginformation about the hand gesture(s), and for each image or series ofimages received, the computer system 1308 derives the posture,orientation, and location of the hand, pair of hands, or any combinationof one or more hands and any other body part (e.g., the head)(hereinafter grouped and denoted generally as “gesture characteristics”,or where specifically related to a hand, as “hand gesturecharacteristics”, or “hand characteristics”) in relation to a predefinedcoordinate system of the environment in which the gesture is employed.Gesture analysis involves tracking the user's hand(s) in real-time.Hidden Markov Models (HMMs) can be employed for recognition ofcontinuous gesture kinematics. In addition, the video output from thevideo cameras 1322 is used to ascertain the gesture characteristics at atime substantially contemporaneous with the generation of the gestureand in terms of the predefined coordinate system. Once the gesturecharacteristics are processed, they are used to determine whether anobject in the environment should be controlled by the computer system1308. If so, then that object is selected for future control actions.Moreover, stochastic tools such as Kalman filtering can be used topredict the position of the hand or hands in subsequent image frames.

Note that an object includes an object or device external to thecomputer system 1308 and controllable by a wireless and/or wiredconnection, as well as any internal device or feature that comprisessoftware programs that are used to display images, manipulate data, andmove data from one location to another, for example.

The process begins by generating a model of the environment. Thisprocess includes, but is not limited to, defining what aspects of theenvironment will be controlled by the computer system 1308, such aslights, lighting level, room temperature, operating room life supportmachines and other computer controlled machines in the room, andsoftware controls that will be required or desired of the system 1308before, during, and/or after the procedure. The software controlscomprise the gestures required to initiate image paging, image rotationabout a vertex, image rotation about an axis, zooming in and out on animage, providing supplementary data (e.g., video and audio) related toan image being presented or manipulated in a certain way, performing x,ytranslations of the image, stepped rotation, changing user interfacecoloring to improve visibility of an image, changing image contrast,changing resolution of an image, playing a series of images quickly orslowly (looping speed), freezing and unfreezing a looping image video(of, for example, echocardiography, transverse CT (Computed Tomography)and cryosection images, CT output, and a fly-through of MRI data),initiating repetitive image(s) playback (looping), jumping from thefirst monitor 1316 to another monitor (1318 or 1320), and adjustingaudio controls when listening to audio data (e.g., EKG) during theprocedure.

The next step is to calibrate the model according to the persons whowill be working in the environment and interacting with the system 1308.Unique user profiles can be generated for each person interacting withthe system 1308 by employing a tagging system that can discriminate thevarious users. This can be accomplished in several ways. One methodprovides a unique RF tag to each user. A triangulation system can beutilized to continually monitor the location of a given user, andassociate the location data with the captured image data such thatgestures are from that location will be processed against that userprofile to properly execute the user command.

Another method is to employ several camera sets, where each set isdedicated to a specific user or location in which the user will work.The user could also be clothed in a colored uniform where thecombination of color, gesture, and location uniquely identify thecommand issued by that user to the system 1308. As mentionedhereinabove, the system 1308 can be programmed to invoke abi-directional confirmation system such that each user gesture isfollowed by a confirmation request to ensure that the proper usercommand is issued. Feedback can be provided by displaying the command inlarge text or outputting the anticipated command in audio to the user,after which the user responds by voice or with another gesture to acceptor reject the command.

The imaging system 1308 detects gesture object (or hand) depth ordistance from the system 1308 to facilitate discriminating between aplurality of gesture sources. The gesture sources can include a Singlehand, two hands, one hand of two people, etc. RF triangulationtechniques can be used to accurately determine the gesture source(s).Thus, the gesture source includes an RF tag. If two hands are being usedin gesticulation, then each hand can include a unique RF tag. Otherdepth determination systems can be employed to accurately discriminatethe gesture sources, such as infrared.

As with other implementations mention above, the environment needs to bemodeled for all objects to be controlled or interacted with, includingboth hardware and software. The gestures are then defined and associatedwith the objects. This can further include the use of voice commands,and where the wireless remote device is worn in alignment with theperson's line-of-sight, the additional combination of “gaze” signals,where the gaze signals are defined as those wireless device (or wand)signals generated when the person looks in a direction to effect objectinteraction.

The system 1308 can also be configured to determine when the operator isgenerally facing the system 1308. A facial image can be captured andprocessed with facial features providing generally the data needed tomake such an automatic determination. Another method includes wearing amulti-colored uniformed such that one color is associated with theperson facing the system 1308, while another imaged color indicates theperson is not facing the system 1308. Still another method employs areflective surface on the front of the person such that the presence ofreflective signals indicates the person is facing the system 1308.

The system 1308 is capable of determining when one person programmed tointeract therewith has been replaced by another. This causes anautomatic change of user profiles to enable the present user's gesturesfor corresponding user commands and control of the system 1308. Again,this can be facilitated by a color scheme whereby each medical person isuniquely identified in the system 1308 with a unique color. Any sort oftag-identification system could be used, as well. Of course, voicecommands can also be used to facilitate personnel replacements in themedical environment.

Image processing demands, especially for 3-D imaging, can place anenormous burden on the operating computer system 1038. As mentionedhereinabove, the system 1308 can be distributed across two or morecomputers as a multi-computer system to supply the processing power for3-D image processing. The disclosed imaging system software can then bedistributed across the multi-computer system for the exchange of dataneeded for ultimately making decisions for human-machine interaction.

The system 1308 can also employ a bi-directional interaction scheme toconfirm selection of all gesture, and gesture/voice actions. Forexample, if the user initiates a user command for starting a session,and the system 1308 responds with a signal that further requires a userresponse to confirm that a session is to begin. The confirmation respondcan be in the form of a duplicate gesture and/or voice command.Obviously, the number and combination of gestures and voice commandsthat can be employed singly or in combination in accordance with thepresent system are numerous.

The system 1308 also includes audio input capabilities such that notonly voice signals can be received and processed, but clicking sounds,pitch-related sounds, etc., and other distinctive audio signals can beemployed to further extend the number of inputs for controlling thesystem 1308. Such alternative inputs can be input through the portablemicrophone system 1328 worn by at least one medical person in theoperating room. Moreover, additional haptics inputs can be employed byproviding a suit or vest with various touch or pressure points toaugment the number of signals for controlling the system 1308. Thus, thewrist, forearm, and other appendage points can be used to initiate andsend signals from the suit through a wireless remote pressure pointtransmission system, made part of the wireless voice communicationsystem 1328, for example.

Referring now to FIG. 18, there is illustrated a flowchart of a processfrom the perspective of the person for using the system of FIG. 17. At1800, the user performs a calibration process that comprises associating(or mapping) a number of gestures in the form of hand poses andmovements, head movements, and/or voice commands with user commands.This also includes using voice commands singly to control the operationcomputer system or in combination with the gestures to do so. Thecalibration process can be performed well in advance of use in theoperating room, and updated as the user chooses to change movementsand/of voice signals with user commands. At 1802, the person initiates asession with the computer system using one or more gestures. At 1804,the person then inputs one or more of the user commands using gesturesto control operation of the computing system. At 1806, the personterminates the session using one or more of the gestures. The processthen reaches a Stop block.

Referring now to FIG. 19, there is illustrated a flowchart of a processfrom the perspective of the system of FIG. 17. At 1900, the calibrationprocess occurs for a user where the user presents one or more hands and,hand poses, and orientations to the imaging system for capture andassociation with a given user command. The system then maps the imagesto the user command. This occurs for a number of different commands, andcompletes the calibration phase for that user. At 1902, the userpresents one or more gestures that are captured and processed by thesystem for user commands. At 1904, the system determines if theprocessed user command(s) indicate that a session is to be started withthat user. If NO, flow is back to the input of 1902 to continue toprocess of receiving and interpreting gestures and/or voice signals. IfYES, flow is to 1906 to identify the user. This can be via atriangulation system that determines the location of the source of thegestures. In one implementation, a glove of the medical person includesan RF device, or similar device that is detectable by the system for thepurpose of determining the source of the gesture signals. At 1908, theuser profile of calibration data is activated for use. Of course, on anygiven operation, the operating staff are assigned, such that the log-innames of the doctors and assistants can be entered prior to beginningthe operation. Thus, the user profiles are already activated forprocessing.

At 1910, the gestures are imaged, received, and processed to execute thecorresponding the user command(s). At 1912, where a classifier isemployed, the classifier tracks, processes gesture images, compares theimages, and updates user gestures characteristics associated with theparticular user command. At 1914, the computer system determines if thesession has completed. If NO, flow is back to the input of 1910 tocontinue to process gestures into user commands. If YES, flow is to 1916to terminate the user session. The process then reaches a Stop block. Ofcourse, from 1916, flow can be brought back to the input of 1902 tocontinue to process gestures or to prepare for another session, whichcould occur many times during the operating room event.

Referring now to FIG. 20, there is illustrated a medical environment2000 in which a 3-D imaging computer control system 2002 is employedwith the remote control device 1404 to process hand (or body) gesturesand control the system 1308 in accordance with the present invention.The imaging and image processing capabilities of the 3-D imaging system1308 and the head-mounted wand 1404 can be employed in combination tofurther enhance the hands-free capabilities of the present invention.Moreover, the wireless vocalization system 1328 can further be used toaugment control of the system 1308. As indicated previously, the wandelectronics can be repackaged for use in many different ways. Forexample, the packaging can be such that the wireless system is worn onthe wrist, elbow, leg, or foot. The system 1308 can be used to imageboth the gestures of the person 1306 and the orientation of the wand1404 to provide more accurate human-machine interaction and control.Each of the systems have been described herein, the details of which arenot repeated here for the purpose of brevity. Sample gestures, voicecommands and gaze signals used in the system 2002 are describedhereinbelow.

Referring now to FIG. 21A, there is illustrated sample one-handed andtwo-handed gestures that can be used to control the operation computingsystem in accordance with the present invention. At 2100, two closedfists (left and right) can be programmed for imaging and interpretationto cause axis control. At 2102, the right hand in a pointing pose can beused in two orientations, a vertical orientation followed by a sidewaysclockwise rotation, the combination of which can be programmed forimaging and interpretation to tilt a selected axis a predeterminednumber of degrees, and keep tilting the axis in stepped increments. At2104, a continuation of the gestures of 2102 in reverse where, thesideways clockwise rotation is reversed to a counterclockwise rotationfollowed by the vertical orientation, the combination of which can beprogrammed for imaging and interpretation stop axis tilting, andmaintain at the current tilt angle. At 2106, a right-handedtwo-fingers-raised pose can be used to rotate an image about an existingaxis. Note that the image can be x-rays of the patient, MRI (MagneticResonance Imaging) frames, etc. At 2108, the thumb and pointing fingerpose of the right hand can be used to rotate an image about a vertexpoint.

Referring now to FIG. 21B, there is illustrated additional sampleone-handed gestures and sequenced one-handed gestures that can be usedto control the operation computing system in accordance with the presentinvention. At 2110, an open right hand with fingers tightly aligned canbe used to initiate a zoom-in feature such that the zoom-in operationcontinues until the gesture changes. At 2112, a right hand where thethumb and pinky finger are extended can be used to initiate a zoom-outfeature such that the zoom-out operation continues until the gesturechanges. At 2114, a sequence of right-hand gestures are used to selectan image for x,y translation, and then to translate the image up and tothe right by a predefined distance or percentage of available viewingspace on the display. Here, the right hand is used to provide an openhand plus closed fist plus open hand, and then move the open hand up andto the right a short distance. This can be recognized and interpreted toperform the stated function of axis translation in an associateddirection. At 2116, a sideways pointing pose plus a counterclockwisemotion is programmed for interpretation to rotate the object in thehorizontal plane. At 2118, the same hand pose plus a circular motion inthe opposite direction can be programmed to rotate the object in thevertical plane. Note, however, that the hand pose is arbitrary, in thatit may be more intuitive to use a hand pose where one or more of thefingers point upward. Moreover, the gesture, itself is also arbitrary,and is programmable according to the particular desires of the user.

Referring now to FIG. 21C, there is illustrated additional sampleone-handed gestures that can be used to control the operation computingsystem in accordance with the present invention. At 2120, a three-fingeropen with index and thumb touching of the right hand can be used toimpose a triaxial grid on a 3-D image. At 2122, a right-handed singlepointing-finger pose can be used to select the x-axis; a right-handedtwo-finger pose can be used to select the y-axis; and, a right-handedthree-finger pose can be used to select the z-axis. At 2124, apinky-finger pose can be used to stop, start and loop videos on thesystem 1308. It is also possible using various “structure-from-motion”techniques to track arbitrary points on the hand, and over time, deducethe change in 3-D orientation of the object in such a way that the userneed not adopt some predefined pose. In this case, however, the user canenter the 3-D rotation mode by another method.

At 2126, rotation of the pinky-finger pose in a clockwise directionwhile facing the system 1308 can be used to control intensity of themonitor, and volume on/off control and amplitude. These are beinggrouped of brevity, since, for example, the pinky-finger pose and/orrotation can be mapped to any one of the functions described. At 2128,an open hand gesture in a clockwise rotation can be used to rotate animage about an axis according to the speed of movement of the open hand,such that when the hand stops, the axis rotation also stops, and startswhen hand movement starts.

Referring now to FIG. 21D, there is illustrated additional sampleone-handed gestures used in combination with voice commands that can beused to control the operation computing system in accordance with thepresent invention. At 2130, the open hand pose plus a voiced “ZOOM”command can be used to zoom in on a displayed image until the gesturechanges or a different command is voiced. At 2132, the thumb and pinkyfinger extended pose plus a voiced “ZOOM” command can be used to zoomout on a displayed image until the gesture changes, or a differentcommand is voiced. Depth information can also be used, e.g., movingcloser would trigger a zoom-in function. Alternatively, when zoom isinvoked, movement in depth can control the zoom value continuously.

At 2134, a left-handed open hand pose in a sideways orientation plus avoiced “MOVE” command can be used to move a selected image to the rightuntil the gesture changes and stops movement. At 2136, a right-handedopen hand pose in a sideways orientation plus a voiced “MOVE” commandcan be used to move a selected image to the left until the gesturechanges and stops movement. At 2138, a closed fist in a circular motionin combination with a “LOUD” voice command can be used to turn audiovolume on/off, and control the amplitude during the procedure to listento the patient's EKG, for example.

Referring now to FIG. 21E, there is illustrated additional sampleone-handed gestures used in combination with voice commands and gazesignals that can be used to control the operation computing system inaccordance with the present invention. At 2140, the right-hand open-handpose in combination with a voiced “ZOOM” command while gazing at animage on a first display of the operation computer system will invoke azoom-in process on the image of the first display until the gesture ischanged. At 2142, the thumb and pinky finger extended of the pose of theright hand is used in combination with a voiced “ZOOM” command whilegazing in the direction of an image presented on a second display tocontrol the computer system to zoom out on the image of the seconddisplay until the gesture changes. At 2144, a left-handed open hand posein a sideways orientation in combination with a voiced “MOVE” commandwhile gazing at an image on a first display of the operation computersystem will invoke a rightward move operation on the image of the firstdisplay until the gesture is changed. At 2146, a right-handed open handpose in a sideways orientation in combination with a voiced “MOVE”command while gazing at an image on a second display of the operationcomputer system will invoke a leftward move operation on the image ofthe second display until the gesture is changed. At 2148, a closed rightfist in a circular clockwise motion in combination with a voiced “LOUD”command and a gaze in the direction of a graphical interface of an audiocontrol device on a third display of the computer control system resultsin volume on/off control and amplitude control.

It is to be appreciated that numerous other combinations of hand poses,body gestures, voice commands and gaze orientations can be employed toeffect control of the medical operation environment. Only a few samplesof the individual and combinatory capabilities are provided herein.

The complementary nature of speech and gesture is well established. Ithas been shown that when naturally gesturing during speech, people willconvey different sorts of information than is conveyed by the speech. Inmore designed settings such as interactive systems, it may also beeasier for the user to convey some information with either speech orgesture or a combination of both. For example, suppose the user hasselected an object as described previously and that this object is astereo amplifier controlled via a network connection by the hostcomputer. Existing speech recognition systems would allow a user tocontrol the volume by, for example, saying “up volume” a number of timesuntil the desired volume is reached. However, while such a procedure ispossible, it is likely to be more efficient and precise for the user toturn a volume knob on the amplifier. This is where the previouslydescribed gesture recognition system can come into play. Rather thanhaving to turn a physical knob on the amplifier, the user would employthe pointer to control the volume by, for example, pointing at thestereo and rolling the pointer clockwise or counterclockwise torespectively turn the volume up or down. The latter procedure canprovide the efficiency and accuracy of a physical volume knob, while atthe same time providing the convenience of being able to control thevolume remotely as in the case of the voice recognition control scheme.This is just one example of a situation where gesturing control is thebest choice, there are others. In addition, there are many situationswhere using voice control would be the best choice. Still further, thereare situations where a combination of speech and gesture control wouldbe the most efficient and convenient method. Thus, a combined systemthat incorporates the previously described gesturing control system anda conventional speech control system would have distinct advantages overeither system alone.

To this end, as indicated hereinabove, the present invention includesthe integration of a conventional speech control system into the gesturecontrol and pointer systems which results in a simple framework forcombining the outputs of various modalities such as pointing to targetobjects and pushing the button on the pointer, pointer gestures, andspeech, to arrive at a unified interpretation that instructs a combinedenvironmental control system on an appropriate course of action. Thisframework decomposes the desired action (e.g., “turn up the volume onthe amplifier”) into a command (i.e., “turn up the volume”) and areferent (i.e., “the amplifier”) pair. The referent can be identifiedusing the pointer to select an object in the environment as describedpreviously or using a conventional speech recognition scheme, or both.The command may be specified by pressing the button on the pointer, orby a pointer gesture, or by a speech recognition event, or anycombination thereof. Interfaces that allow multiple modes of input arecalled multimodal interfaces. With this multimodal command/referentrepresentation, it is possible to effect the same action in multipleways. For example, all the following pointing, speech and gestureactions on the part of the user can be employed in the present controlsystem to turn on a light that is under the control of the hostcomputer:

1. Say “turn on the desk lamp”;

2. Point at the lamp with the pointer and say “turn on”;

3. Point at the lamp with the pointer and perform a “turn on” gestureusing the pointer;

4. Say “desk lamp” and perform the “turn on” gesture with the pointer;

5. Say “lamp”, point toward the desk lamp with the pointer rather thanother lamps in the environment such as a floor lamp, and perform the“turn on” gesture with the pointer; and

6. Point at the lamp with the pointer and press the pointer's button(assuming the default behavior when the lamp is off and the button isclicked, is to turn the lamp on).

By unifying the results of pointing, gesture recognition and speechrecognition, the overall system is made more robust. For example, aspurious speech recognition event of “volume up” while pointing at thelight is ignored, rather than resulting in the volume of an amplifierbeing increased, as would happen if a speech control scheme were beingused alone. Also, consider the example given above where the user says“lamp” while pointing toward the desk lamp with the pointer rather thanother lamps in the environment, and performing the “turn on” gesturewith the pointer. In that example, just saying lamp is ambiguous, butpointing at the desired lamp clears up the uncertainty. Thus, byincluding the strong contextualization provided by the pointer, thespeech recognition may be made more robust.

The speech recognition system can employ a very simple command andcontrol (CFG) style grammar, with preset utterances for the variouselectronic components and simple command phrases that apply to thecomponents. The user wears a wireless lapel microphone to relay voicecommands to a receiver which is connected to the host computer and whichrelays the received speech commands to the speech recognition systemrunning on the host computer.

While various computational frameworks could be employed, the multimodalintegration process employed in the present control system uses adynamic Bayes network that encodes the various ways that sensor outputsmay be combined to identify the intended referent and command, andinitiate the proper action.

The identity of the referent, the desired command and the appropriateaction are all determined by combining the outputs of the speechrecognition system, gesture recognition system and pointing analysisprocesses using a dynamic Bayes network architecture. Bayes networkshave a number of advantages that make them appropriate to this task.First, it is easy to break apart and treat separately dependencies thatotherwise would be embedded in a very large table over all the variablesof interest. Secondly, Bayes networks are adept at handlingprobabilistic (noisy) inputs. Further, the network represents ambiguityand incomplete information that may be used appropriately by the system.In essence, the Bayes network preserves ambiguities from one time stepto the next while waiting for enough information to become available tomake a decision as to what referent, command or action is intended. Itis even possible for the network to act proactively when not enoughinformation is available to make a decision. For example, if the userdoesn't point at the lamp, the system might ask which lamp is meantafter the utterance “lamp”.

However, the Bayes network architecture is chosen primarily to exploitthe redundancy of the user's interaction to increase confidence that theproper action is being implemented. The user may specify commands in avariety of ways, even though the designer specified only objects to bepointed to, utterances to recognize and gestures to recognize (as wellas how referents and commands combine to result in action). For example,it is natural for a person to employ deictic (pointing) gestures inconjunction with speech to relay information where the speech isconsistent with and reinforces the meaning of the gesture. Thus, theuser will often naturally indicate the referent and command applicableto a desired resulting action via both speech and gesturing. Thisincludes most frequently pointing at an object the user wants to affect.

The Bayes network architecture also allows the state of various devicesto be incorporated to make the interpretation more robust. For example,if the light is already on, the system may be less disposed to interpreta gesture or utterance as a “turn on” gesture or utterance. In terms ofthe network, the associated probability distribution over the nodesrepresenting the light and its parents, the Action and Referent nodes,are configured so that the only admissible action when the light is onis to turn it off, and likewise when it is off the only action availableis to turn it on.

Still further, the “dynamic” nature of the dynamic Bayes network can beexploited advantageously. The network is dynamic because it has amechanism by which it maintains a short-term memory of certain values inits network. It is natural that the referent will not be determined atthe exact moment in time as the command. In other words a user will nottypically specify the referent by whatever mode (e.g., pointing and/orspeech) at the same time he or she relays the desired command using oneof the various methods available (e.g., pointer button push, pointergesture and/or speech). If the referent is identified only to beforgotten in the next instant of time, the association with a commandthat comes after it will be lost. The dynamic Bayes network models thelikelihood of a referent or a command applying to future time steps as adynamic process. Specifically, this is done via a temporal integrationprocess in which probabilities assigned to referents and commands in thelast time step are brought forward to the current time step and areinput along with new speech, pointing and gesture inputs to influencethe probability distribution computed for the referents and commands inthe current time step. In this way, the network tends to hold a memoryof a command and referent that decays over time, and it is thusunnecessary to specify the command and referent at exactly the samemoment in time. In one example, this propagation occurred four times asecond.

Note that although a previous description was centered on an operatingroom environment, the present invention has application in many otherenvironments where data access and presentation is beneficial or evennecessary to facilitate a man-machine interface.

What has been described above includes examples of the presentinvention. It is, of course, not possible to describe every conceivablecombination of components or methodologies for purposes of describingthe present invention, but one of ordinary skill in the art mayrecognize that many further combinations and permutations of the presentinvention are possible. Accordingly, the present invention is intendedto embrace all such alterations, modifications, and variations that fallwithin the spirit and scope of the appended claims. Furthermore, to theextent that the term “includes” is used in either the detaileddescription or the claims, such term is intended to be inclusive in amanner similar to the term “comprising” as “comprising” is interpretedwhen employed as a transitional word in a claim.

1. A storage medium storing a sound output control program for a soundoutput control apparatus that outputs a sound from an output means inaccordance with manipulation of an operating means, wherein saidoperating means comprises an acceleration sensor for detectingaccelerations in directions of at least two axes orthogonal to eachother, said sound output control program allows a processor of saidsound output control apparatus to execute: an acquisition step ofacquiring the accelerations detected by said acceleration sensor; acalculation step of calculating sum of accelerations of two axesacquired in said acquisition step; a sound control step of generating acontrol signal for sound output in accordance with a change in said sumof accelerations; and a sound output step of outputting a sound fromsaid output means based on said control signal.
 2. A storage mediumstoring a sound output control program according to claim 1, whereinsaid sound control step includes a determination step of determiningwhether a change in said sum of accelerations has formed a predeterminedrelationship with any of a plurality of threshold values stored in astorage means, and when it is determined in said determination step thata change in said sum of accelerations has formed a predeterminedrelationship with said threshold value, said control signal for soundoutput is generated.
 3. A storage medium storing a sound output controlprogram according to claim 1, wherein, in said sound control step, atone of sound to be output is selected on the basis of a change in saidsum of accelerations.
 4. A storage medium storing a sound output controlprogram according to claim 1, wherein said operating means furthercomprises a sound emission instruction means for providing aninstruction on whether or not to carry out sound output and, in saidsound control step, presence or absence of the sound output iscontrolled in accordance with the instruction from said sound emissioninstruction means.
 5. A sound output control apparatus for outputting asound from an output means in accordance with manipulation of anoperating means having an acceleration sensor for detectingaccelerations in at least two directions orthogonal to each other,comprising: an acquisition means for acquiring the accelerationsdetected by said acceleration sensor; a calculation means forcalculating sum of accelerations of two axes acquired by saidacceleration means; a sound control means for generating a controlsignal for sound output in accordance with a change in said sum ofaccelerations; and a sound output means for outputting a sound from saidoutput means based on said control signal.
 6. A storage medium storing avolume control program for a volume control device that outputs a soundfrom a speaker in accordance with manipulation of a pointer, whereinsaid pointer comprises an acceleration sensor for detecting accelerationvalues in directions of at least two axes orthogonal to each other, saidvolume control program allows a processor of said volume control deviceto execute: an acquisition step of acquiring the acceleration valuesdetected by said acceleration sensor; a calculation step of combiningthe acceleration values of two axes acquired in said acquisition step; avolume control step of generating a control signal for outputting soundin accordance with a change in said combination of acceleration values;and a volume output step of outputting sound from said speaker based onsaid control signal.
 7. A storage medium storing a volume controlprogram according to claim 6, wherein said volume control step includesa determination step of determining whether a change in said combinationof acceleration values has exceeded a predetermined threshold valuestored in a storage means, and when it is determined in saiddetermination step that a change in said combination of accelerationvalues has exceeded said threshold value, said control signal for soundoutput is generated.
 8. A storage medium storing a volume controlprogram according to claim 6, wherein, in said volume control step, atone of sound to be output is selected on the basis of a change in saidcombination of acceleration values.
 9. A storage medium storing a volumecontrol program according to claim 6, wherein said pointer furthercomprises a sound processing unit providing instructions on whether ornot to output sound and, in said sound control step, presence or absenceof the sound output is controlled in accordance with the instructionsfrom said sound processing unit.
 10. A volume control device foroutputting a sound from a speaker in accordance with manipulation of apointer having an acceleration sensor for detecting acceleration valuesin at least two directions orthogonal to each other, comprising: anacquisition processing unit acquiring the acceleration values detectedby said acceleration sensor; a calculation processing unit calculating acombination of acceleration values of two axes acquired by saidacceleration processing unit; a volume control processing unitgenerating a control signal for outputting sound in accordance with achange in said combination of acceleration values; and a speakeroutputting a sound based on said control signal.
 11. A game apparatusfor executing a game process using acceleration data output from amulti-axis acceleration sensor included in an input device, comprising:an obtaining means for successively obtaining the acceleration data; agravity direction calculating means for calculating a gravity directionwhich is represented using an orientation of the input device as areference, based on values of a plurality of pieces of acceleration dataobtained during a predetermined period; a first movement directioncalculating means for calculating a first movement direction which is amovement direction of the input device which is represented using theorientation of the input device as the reference, based on the values ofthe plurality of pieces of acceleration data obtained during thepredetermined period; and a game processing means for executing a gameprocess based on the first movement direction and the gravity direction.12. The game apparatus according to claim 11, further comprising: asecond movement direction calculating means for calculating a secondmovement direction which is a movement direction of the input devicewith respect to the gravity direction, based on the first movementdirection and the gravity direction, wherein the game processing meansexecutes a game process based on the second movement direction.
 13. Thegame apparatus according to claim 12, wherein the game processing meansexecutes a game process for moving an object displayed on a displaydevice in a direction corresponding to the second movement direction.14. The game apparatus according to claim 11, further comprising: aperiod detecting means for detecting a period from start to end ofmovement of the input device, as the predetermined period, based on theacceleration data obtained by the obtaining means.
 15. The gameapparatus according to claim 11, wherein the first movement directioncalculating means calculates the first movement direction fromtransition of values of an acceleration data group obtained during thepredetermined period.
 16. A computer readable storage medium storing agame program for causing a computer in a game apparatus for executing agame process using acceleration data output from a multi-axisacceleration sensor included in an input device, to execute: anobtaining step of successively obtaining the acceleration data; agravity direction calculating step of calculating a gravity directionwhich is represented using an orientation of the input device as areference, based on values of a plurality of pieces of acceleration dataobtained during a predetermined period; a first movement directioncalculating step of calculating a first movement direction which is amovement direction of the input device which is represented using theorientation of the input device as the reference, based on the values ofthe plurality of pieces of acceleration data obtained during thepredetermined period; and a game processing step of executing a gameprocess based on the first movement direction and the gravity direction.17. The storage medium according to claim 16, wherein the game programcauses the computer to further execute: a second movement directioncalculating step of calculating a second movement direction which is amovement direction of the input device with respect to the gravitydirection, based on the first movement direction and the gravitydirection, wherein, in the game processing step, the computer executes agame process based on the second movement direction.
 18. The storagemedium according to claim 17, wherein, in the game processing step, thecomputer executes a game process for moving an object displayed on adisplay device in a direction corresponding to the second movementdirection.
 19. The storage medium according to claim 16, wherein thegame program causes the computer to further execute: a period detectingstep of detecting a period from start to end of movement of the inputdevice, as the predetermined period, based on the acceleration dataobtained by the obtaining step.
 20. The storage medium according toclaim 16, wherein, in the first movement direction calculating step, thecomputer calculates the first movement direction from transition ofvalues of an acceleration data group obtained during the predeterminedperiod.
 21. A game device for executing a game process usingacceleration data output from a multi-axis acceleration sensor includedin a pointer, comprising: an obtaining processing unit for successivelyobtaining the acceleration data; a pitch angle computing processing unitfor computing a pitch angle using an orientation of the pointer as areference, based on a plurality of acceleration values obtained during acalibration period; a first movement direction computing processing unitfor computing a first movement direction of the pointer using theorientation of the pointer as the reference, based on the plurality ofacceleration values obtained during the calibration period; and a gameprocessing unit for executing a game process based on the first movementdirection and the pitch angle.
 22. The game device according to claim21, further comprising: a second movement direction computing processingunit for computing a second movement direction of the pointer withrespect to the pitch angle, based on the first movement direction andthe pitch angle, wherein the game processing unit executes a gameprocess based on the second movement direction.
 23. The game deviceaccording to claim 22, wherein the game processing unit executes a gameprocess for moving an object displayed on a display in a directioncorresponding to the second movement direction.
 24. The game deviceaccording to claim 21, further comprising: a period detecting processingunit for detecting a period from start to end of movement of thepointer, as the predetermined period, based on the acceleration dataobtained by the obtaining processing unit.
 25. The game device accordingto claim 21, wherein the first movement direction computing processingunit computes the first movement direction from the acceleration dataobtained during the predetermined period.
 26. A computer readablestorage medium storing a game program for causing a computer in a gamedevice for executing a game process using acceleration data output froma multi-axis acceleration sensor included in a pointer, to execute: anobtaining step of successively obtaining the acceleration data; a pitchangle computing step of computing a pitch angle using an orientation ofthe pointer as a reference, based on a plurality of values ofacceleration data obtained during a predetermined period; a firstmovement direction computing step of computing a first movementdirection which is a movement direction of the pointer using theorientation of the pointer as the reference, based on the plurality ofvalues of the acceleration data obtained during the predeterminedperiod; and a game processing step of executing a game process based onthe first movement direction and the pitch angle.
 27. The storage mediumaccording to claim 26, wherein the game program causes the computer tofurther execute: a second movement direction computing step of computinga second movement direction which is a movement direction of the pointerwith respect to the pitch angle, based on the first movement directionand the pitch angle, wherein, in the game processing step, the computerexecutes a game process based on the second movement direction.
 28. Thestorage medium according to claim 27, wherein, in the game processingstep, the computer executes a game process for moving an objectdisplayed on a display in a direction corresponding to the secondmovement direction.
 29. The storage medium according to claim 26,wherein the game program causes the computer to further execute: aperiod detecting step of detecting a period from start to end ofmovement of the pointer, as the predetermined period, based on theacceleration data obtained by the obtaining step.
 30. The storage mediumaccording to claim 26, wherein, in the first movement directioncomputing step, the computer computes the first movement direction fromthe acceleration data obtained during the predetermined period.